Importing Necessary Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline
file_path = 'Ethereum Merged Data.csv'
ethereum_data = pd.read_csv(file_path)
ethereum_data['Timestamp'] = pd.to_datetime(ethereum_data['Timestamp'])
numeric_columns = ['Price', 'Volume', 'Market Cap'] # specify the numeric columns
ethereum_data[numeric_columns] = ethereum_data[numeric_columns].apply(pd.to_numeric, errors='coerce')
ethereum_data.fillna(ethereum_data.median(), inplace=True)
z_scores = stats.zscore(ethereum_data[numeric_columns], nan_policy='omit')
outliers = np.abs(z_scores) > 3
print("Outliers detected:\n", ethereum_data[outliers.any(axis=1)])
Outliers detected:
Timestamp Price Volume Market Cap
1786 2020-09-11 367.638929 7.474742e+10 4.130033e+10
1901 2021-01-04 967.000597 1.409065e+11 1.125254e+11
1902 2021-01-05 1025.654768 6.228514e+10 1.166932e+11
1903 2021-01-06 1103.358252 4.714825e+10 1.251129e+11
1904 2021-01-07 1208.575093 4.788685e+10 1.373068e+11
... ... ... ... ...
2240 2021-12-09 4431.540647 1.966195e+10 5.258504e+11
2394 2022-05-12 2080.910244 4.654801e+10 2.500668e+11
2427 2022-06-14 1205.595286 4.757173e+10 1.460727e+11
2692 2023-03-06 1563.225662 6.217285e+10 1.883375e+11
2700 2023-03-14 1678.915634 6.521171e+10 2.022942e+11
[85 rows x 4 columns]
C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\4234391317.py:17: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version. ethereum_data.fillna(ethereum_data.median(), inplace=True)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
The following libraries are essential for data manipulation, visualization, and statistical analysis:
pandas: For data manipulation and ingestion.numpy: Provides support for efficient numerical computation.matplotlib.pyplot and seaborn: For plotting graphs that are visually appealing.scipy.stats: For statistical functions.%matplotlib inline is a magic function that renders the figure in a notebook (instead of displaying a dump of the figure object).
file_path variable holds the location of the dataset.ethereum_data = pd.read_csv(file_path): Reads the Ethereum dataset into a DataFrame.numeric_columns identifies the columns with numerical data.apply(pd.to_numeric, errors='coerce'): Converts the values in these columns to numeric types, coercing errors by replacing non-numeric values with NaN.Displays rows in the dataset that contain outliers in any of the specified numeric columns.
McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media.
Working with the OS Library and Current Working Directory
import os
current_directory = os.getcwd()
print("Current Working Directory:", current_directory)
Current Working Directory: C:\Users\Luke Holmes
Reading and Displaying Data from a CSV File
print(ethereum_data.head())
Timestamp Price Volume Market Cap 0 2015-10-21 0.439769 599041.013152 3.259030e+07 1 2015-10-22 0.565462 979304.072423 4.191854e+07 2 2015-10-23 0.540738 866798.488854 4.010011e+07 3 2015-10-24 0.568574 259157.662411 4.217907e+07 4 2015-10-25 0.631939 476617.738601 4.689593e+07
Analyzing the Structure and Basic Statistics of Ethereum Data
ethereum_data.shape
ethereum_data.describe
ethereum_data.head
<bound method NDFrame.head of Timestamp Price Volume Market Cap 0 2015-10-21 0.439769 5.990410e+05 3.259030e+07 1 2015-10-22 0.565462 9.793041e+05 4.191854e+07 2 2015-10-23 0.540738 8.667985e+05 4.010011e+07 3 2015-10-24 0.568574 2.591577e+05 4.217907e+07 4 2015-10-25 0.631939 4.766177e+05 4.689593e+07 ... ... ... ... ... 2959 2023-11-28 2030.000506 1.922688e+10 2.437204e+11 2960 2023-11-29 2048.535257 1.642457e+10 2.462628e+11 2961 2023-11-30 2025.937328 1.309906e+10 2.440015e+11 2962 2023-12-01 2051.756718 1.162592e+10 2.468482e+11 2963 2023-12-02 2085.712361 1.991308e+10 2.506407e+11 [2964 rows x 4 columns]>
ethereum_data['Timestamp'] = pd.to_datetime(ethereum_data['Timestamp'])
ethereum_data['year'] = ethereum_data['Timestamp'].dt.year
ethereum_data['month'] = ethereum_data['Timestamp'].dt.month
ethereum_data['day'] = ethereum_data['Timestamp'].dt.day
ethereum_data['weekday'] = ethereum_data['Timestamp'].dt.weekday
ethereum_data['hour'] = ethereum_data['Timestamp'].dt.hour # if hour is relevant
ethereum_data['price_volume_interaction'] = ethereum_data['Price'] * ethereum_data['Volume']
ethereum_data['marketcap_volume_ratio'] = ethereum_data['Market Cap'] / ethereum_data['Volume']
ethereum_data['price_change'] = ethereum_data['Price'].diff()
ethereum_data['volume_change'] = ethereum_data['Volume'].diff()
print(ethereum_data.head())
Timestamp Price Volume Market Cap year month day \ 0 2015-10-21 0.439769 599041.013152 3.259030e+07 2015 10 21 1 2015-10-22 0.565462 979304.072423 4.191854e+07 2015 10 22 2 2015-10-23 0.540738 866798.488854 4.010011e+07 2015 10 23 3 2015-10-24 0.568574 259157.662411 4.217907e+07 2015 10 24 4 2015-10-25 0.631939 476617.738601 4.689593e+07 2015 10 25 weekday hour price_volume_interaction marketcap_volume_ratio \ 0 2 0 263439.651980 54.404118 1 3 0 553758.756288 42.804418 2 4 0 468710.679129 46.262318 3 5 0 147350.248756 162.754481 4 6 0 301193.246187 98.393162 price_change volume_change 0 NaN NaN 1 0.125693 380263.059271 2 -0.024724 -112505.583569 3 0.027836 -607640.826444 4 0.063365 217460.076191
ed=ethereum_data
ed.dropna()
(ed.shape)
(ed.head())
(ed.tail())
| Timestamp | Price | Volume | Market Cap | year | month | day | weekday | hour | price_volume_interaction | marketcap_volume_ratio | price_change | volume_change | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2959 | 2023-11-28 | 2030.000506 | 1.922688e+10 | 2.437204e+11 | 2023 | 11 | 28 | 1 | 0 | 3.903058e+13 | 12.676024 | -34.073712 | 7.151175e+09 |
| 2960 | 2023-11-29 | 2048.535257 | 1.642457e+10 | 2.462628e+11 | 2023 | 11 | 29 | 2 | 0 | 3.364632e+13 | 14.993557 | 18.534751 | -2.802307e+09 |
| 2961 | 2023-11-30 | 2025.937328 | 1.309906e+10 | 2.440015e+11 | 2023 | 11 | 30 | 3 | 0 | 2.653788e+13 | 18.627399 | -22.597929 | -3.325512e+09 |
| 2962 | 2023-12-01 | 2051.756718 | 1.162592e+10 | 2.468482e+11 | 2023 | 12 | 1 | 4 | 0 | 2.385355e+13 | 21.232583 | 25.819390 | -1.473147e+09 |
| 2963 | 2023-12-02 | 2085.712361 | 1.991308e+10 | 2.506407e+11 | 2023 | 12 | 2 | 5 | 0 | 4.153296e+13 | 12.586734 | 33.955643 | 8.287165e+09 |
print(ed.isnull().sum())
ed.dtypes
Timestamp 0 Price 0 Volume 0 Market Cap 0 year 0 month 0 day 0 weekday 0 hour 0 price_volume_interaction 0 marketcap_volume_ratio 0 price_change 1 volume_change 1 dtype: int64
Timestamp datetime64[ns] Price float64 Volume float64 Market Cap float64 year int64 month int64 day int64 weekday int64 hour int64 price_volume_interaction float64 marketcap_volume_ratio float64 price_change float64 volume_change float64 dtype: object
Data Cleaning
ed.rename(columns={'Timestamp': 'date'}, inplace=True)
print(ed.head())
date Price Volume Market Cap year month day \ 0 2015-10-21 0.439769 599041.013152 3.259030e+07 2015 10 21 1 2015-10-22 0.565462 979304.072423 4.191854e+07 2015 10 22 2 2015-10-23 0.540738 866798.488854 4.010011e+07 2015 10 23 3 2015-10-24 0.568574 259157.662411 4.217907e+07 2015 10 24 4 2015-10-25 0.631939 476617.738601 4.689593e+07 2015 10 25 weekday hour price_volume_interaction marketcap_volume_ratio \ 0 2 0 263439.651980 54.404118 1 3 0 553758.756288 42.804418 2 4 0 468710.679129 46.262318 3 5 0 147350.248756 162.754481 4 6 0 301193.246187 98.393162 price_change volume_change 0 NaN NaN 1 0.125693 380263.059271 2 -0.024724 -112505.583569 3 0.027836 -607640.826444 4 0.063365 217460.076191
ed['year'] = pd.to_datetime(ed['date']).dt.year
ed['month'] = pd.to_datetime(ed['date']).dt.month
ed['day'] = pd.to_datetime(ed['date']).dt.day
ed['weekday'] = pd.to_datetime(ed['date']).dt.weekday
Advanced Feature Engineering for Ethereum Price Prediction
ed['price_7day_avg'] = ed['Price'].rolling(window=7).mean()
ed['volume_7day_avg'] = ed['Volume'].rolling(window=7).mean()
ed['price_change_pct'] = ed['Price'].pct_change() * 100
ed['volume_change_pct'] = ed['Volume'].pct_change() * 100
ed['market_cap_volume_ratio'] = ed['Market Cap'] / ed['Volume']
ed['price_lag1'] = ed['Price'].shift(1)
ed['volume_lag1'] = ed['Volume'].shift(1)
ed['price_ema_short'] = ed['Price'].ewm(span=12, adjust=False).mean()
ed['price_ema_long'] = ed['Price'].ewm(span=26, adjust=False).mean()
delta = ed['Price'].diff()
up, down = delta.copy(), delta.copy()
up[up < 0] = 0
down[down > 0] = 0
ed['rsi'] = 100 - (100 / (1 + up.rolling(window=14).mean() / down.abs().rolling(window=14).mean()))
ed['week_of_year'] = pd.to_datetime(ed['date']).dt.isocalendar().week
ed['quarter'] = pd.to_datetime(ed['date']).dt.quarter
ed['days_since_launch'] = (pd.to_datetime(ed['date']) - pd.to_datetime(ed['date']).min()).dt.days
ed['cumulative_return'] = (1 + ed['Price'].pct_change()).cumprod()
ed['cumulative_volume'] = ed['Volume'].cumsum()
ed['price_ma_ratio'] = ed['Price'] / ed['Price'].rolling(window=20).mean()
ed['normalized_price'] = ed['Price'] / ed['Price'].max()
print(ethereum_data.columns)
Index(['date', 'Price', 'Volume', 'Market Cap', 'year', 'month', 'day',
'weekday', 'hour', 'price_volume_interaction', 'marketcap_volume_ratio',
'price_change', 'volume_change', 'price_7day_avg', 'volume_7day_avg',
'price_change_pct', 'volume_change_pct', 'market_cap_volume_ratio',
'price_lag1', 'volume_lag1', 'price_ema_short', 'price_ema_long', 'rsi',
'week_of_year', 'quarter', 'days_since_launch', 'cumulative_return',
'cumulative_volume', 'price_ma_ratio', 'normalized_price'],
dtype='object')
ethereum_data['price_x_volume'] = ethereum_data['Price'] * ethereum_data['Volume']
ethereum_data['marketcap_per_volume'] = ethereum_data['Market Cap'] / ethereum_data['Volume']
ethereum_data['price_squared'] = ethereum_data['Price'] ** 2
ethereum_data['price_change_pct'] = ethereum_data['Price'].pct_change()
ethereum_data['price_change_pct_x_volume'] = ethereum_data['price_change_pct'] * ethereum_data['Volume']
print(ethereum_data.head())
date Price Volume Market Cap year month day \ 0 2015-10-21 0.439769 599041.013152 3.259030e+07 2015 10 21 1 2015-10-22 0.565462 979304.072423 4.191854e+07 2015 10 22 2 2015-10-23 0.540738 866798.488854 4.010011e+07 2015 10 23 3 2015-10-24 0.568574 259157.662411 4.217907e+07 2015 10 24 4 2015-10-25 0.631939 476617.738601 4.689593e+07 2015 10 25 weekday hour price_volume_interaction ... quarter days_since_launch \ 0 2 0 263439.651980 ... 4 0 1 3 0 553758.756288 ... 4 1 2 4 0 468710.679129 ... 4 2 3 5 0 147350.248756 ... 4 3 4 6 0 301193.246187 ... 4 4 cumulative_return cumulative_volume price_ma_ratio normalized_price \ 0 NaN 5.990410e+05 NaN 0.000091 1 1.285815 1.578345e+06 NaN 0.000117 2 1.229595 2.445144e+06 NaN 0.000112 3 1.292892 2.704301e+06 NaN 0.000118 4 1.436979 3.180919e+06 NaN 0.000131 price_x_volume marketcap_per_volume price_squared \ 0 263439.651980 54.404118 0.193397 1 553758.756288 42.804418 0.319747 2 468710.679129 46.262318 0.292397 3 147350.248756 162.754481 0.323276 4 301193.246187 98.393162 0.399347 price_change_pct_x_volume 0 NaN 1 279899.710740 2 -37899.132146 3 13340.871634 4 53116.946443 [5 rows x 34 columns]
ethereum_data['ma_price_x_ma_volume'] = ethereum_data['price_7day_avg'] * ethereum_data['volume_7day_avg']
ethereum_data['rsi_x_price_change_pct'] = ethereum_data['rsi'] * ethereum_data['price_change_pct']
ethereum_data['return_volume_ratio'] = ethereum_data['cumulative_return'] / ethereum_data['cumulative_volume']
print(ethereum_data[['ma_price_x_ma_volume', 'rsi_x_price_change_pct', 'return_volume_ratio']].head())
ma_price_x_ma_volume rsi_x_price_change_pct return_volume_ratio 0 NaN NaN NaN 1 NaN NaN 8.146602e-07 2 NaN NaN 5.028723e-07 3 NaN NaN 4.780873e-07 4 NaN NaN 4.517497e-07
ethereum_data['rsi_squared'] = ethereum_data['rsi'] ** 2
ethereum_data['rsi_cubed'] = ethereum_data['rsi'] ** 3
ethereum_data['rsi_squared_x_price'] = ethereum_data['rsi_squared'] * ethereum_data['Price']
ethereum_data['is_Q4'] = ethereum_data['quarter'].apply(lambda x: 1 if x == 4 else 0)
ethereum_data['is_start_of_year'] = ethereum_data['month'].apply(lambda x: 1 if x == 1 else 0)
ethereum_data['Q4_volume_change'] = ethereum_data['is_Q4'] * ethereum_data['volume_change']
high_volume_median = ethereum_data['Volume'].median()
ethereum_data['high_volume_price_change'] = ethereum_data.apply(lambda x: x['price_change'] if x['Volume'] > high_volume_median else 0, axis=1)
ed.head(50)
| date | Price | Volume | Market Cap | year | month | day | weekday | hour | price_volume_interaction | ... | ma_price_x_ma_volume | rsi_x_price_change_pct | return_volume_ratio | rsi_squared | rsi_cubed | rsi_squared_x_price | is_Q4 | is_start_of_year | Q4_volume_change | high_volume_price_change | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015-10-21 | 0.439769 | 5.990410e+05 | 3.259030e+07 | 2015 | 10 | 21 | 2 | 0 | 2.634397e+05 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 1 | 0 | NaN | 0.0 |
| 1 | 2015-10-22 | 0.565462 | 9.793041e+05 | 4.191854e+07 | 2015 | 10 | 22 | 3 | 0 | 5.537588e+05 | ... | NaN | NaN | 8.146602e-07 | NaN | NaN | NaN | 1 | 0 | 3.802631e+05 | 0.0 |
| 2 | 2015-10-23 | 0.540738 | 8.667985e+05 | 4.010011e+07 | 2015 | 10 | 23 | 4 | 0 | 4.687107e+05 | ... | NaN | NaN | 5.028723e-07 | NaN | NaN | NaN | 1 | 0 | -1.125056e+05 | 0.0 |
| 3 | 2015-10-24 | 0.568574 | 2.591577e+05 | 4.217907e+07 | 2015 | 10 | 24 | 5 | 0 | 1.473502e+05 | ... | NaN | NaN | 4.780873e-07 | NaN | NaN | NaN | 1 | 0 | -6.076408e+05 | 0.0 |
| 4 | 2015-10-25 | 0.631939 | 4.766177e+05 | 4.689593e+07 | 2015 | 10 | 25 | 6 | 0 | 3.011932e+05 | ... | NaN | NaN | 4.517497e-07 | NaN | NaN | NaN | 1 | 0 | 2.174601e+05 | 0.0 |
| 5 | 2015-10-26 | 0.743958 | 1.174027e+06 | 5.522699e+07 | 2015 | 10 | 26 | 0 | 0 | 8.734265e+05 | ... | NaN | NaN | 3.884553e-07 | NaN | NaN | NaN | 1 | 0 | 6.974091e+05 | 0.0 |
| 6 | 2015-10-27 | 0.854455 | 1.887569e+06 | 6.345150e+07 | 2015 | 10 | 27 | 1 | 0 | 1.612843e+06 | ... | 5.535319e+05 | NaN | 3.112470e-07 | NaN | NaN | NaN | 1 | 0 | 7.135421e+05 | 0.0 |
| 7 | 2015-10-28 | 1.010410 | 2.447634e+06 | 7.505796e+07 | 2015 | 10 | 28 | 2 | 0 | 2.473113e+06 | ... | 8.116759e+05 | NaN | 2.643904e-07 | NaN | NaN | NaN | 1 | 0 | 5.600649e+05 | 0.0 |
| 8 | 2015-10-29 | 1.163749 | 2.236842e+06 | 8.647898e+07 | 2015 | 10 | 29 | 3 | 0 | 2.603122e+06 | ... | 1.051975e+06 | NaN | 2.421776e-07 | NaN | NaN | NaN | 1 | 0 | -2.107916e+05 | 0.0 |
| 9 | 2015-10-30 | 1.041849 | 2.384550e+06 | 7.744784e+07 | 2015 | 10 | 30 | 4 | 0 | 2.484341e+06 | ... | 1.333891e+06 | NaN | 1.779721e-07 | NaN | NaN | NaN | 1 | 0 | 1.477078e+05 | 0.0 |
| 10 | 2015-10-31 | 0.907092 | 6.522716e+05 | 6.745344e+07 | 2015 | 10 | 31 | 5 | 0 | 5.916701e+05 | ... | 1.459934e+06 | NaN | 1.477143e-07 | NaN | NaN | NaN | 1 | 0 | -1.732278e+06 | 0.0 |
| 11 | 2015-11-01 | 1.058542 | 6.039962e+05 | 7.874273e+07 | 2015 | 11 | 1 | 6 | 0 | 6.393553e+05 | ... | 1.575586e+06 | NaN | 1.652301e-07 | NaN | NaN | NaN | 1 | 0 | -4.827539e+04 | 0.0 |
| 12 | 2015-11-02 | 0.955046 | 9.706657e+05 | 7.106635e+07 | 2015 | 11 | 2 | 0 | 0 | 9.270302e+05 | ... | 1.595625e+06 | NaN | 1.397627e-07 | NaN | NaN | NaN | 1 | 0 | 3.666695e+05 | 0.0 |
| 13 | 2015-11-03 | 1.002345 | 1.878273e+06 | 7.461345e+07 | 2015 | 11 | 3 | 1 | 0 | 1.882677e+06 | ... | 1.628024e+06 | NaN | 1.308656e-07 | NaN | NaN | NaN | 1 | 0 | 9.076073e+05 | 0.0 |
| 14 | 2015-11-04 | 0.901809 | 3.218065e+06 | 6.715251e+07 | 2015 | 11 | 4 | 2 | 0 | 2.902079e+06 | ... | 1.713799e+06 | -6.632193 | 9.937778e-08 | 4372.241143 | 289105.370958 | 3942.924929 | 1 | 0 | 1.339792e+06 | 0.0 |
| 15 | 2015-11-05 | 0.906601 | 1.202885e+06 | 6.753250e+07 | 2015 | 11 | 5 | 3 | 0 | 1.090537e+06 | ... | 1.508190e+06 | 0.334806 | 9.440279e-08 | 3969.137249 | 250059.970163 | 3598.424342 | 1 | 0 | -2.015180e+06 | 0.0 |
| 16 | 2015-11-06 | 0.909442 | 9.136292e+05 | 6.776739e+07 | 2015 | 11 | 6 | 4 | 0 | 8.308929e+05 | ... | 1.279356e+06 | 0.201469 | 9.089579e-08 | 4133.197826 | 265723.086603 | 3758.904519 | 1 | 0 | -2.892562e+05 | 0.0 |
| 17 | 2015-11-07 | 0.922482 | 9.046319e+05 | 6.876115e+07 | 2015 | 11 | 7 | 5 | 0 | 8.345067e+05 | ... | 1.316602e+06 | 0.915871 | 8.867329e-08 | 4080.096967 | 260618.791701 | 3763.816383 | 1 | 0 | -8.997300e+03 | 0.0 |
| 18 | 2015-11-08 | 1.030500 | 1.040087e+06 | 7.683930e+07 | 2015 | 11 | 8 | 6 | 0 | 1.071809e+06 | ... | 1.370045e+06 | 7.622571 | 9.488461e-08 | 4237.709529 | 275865.110573 | 4366.957611 | 1 | 0 | 1.354549e+05 | 0.0 |
| 19 | 2015-11-09 | 0.995803 | 1.972521e+06 | 7.427755e+07 | 2015 | 11 | 9 | 0 | 0 | 1.964243e+06 | ... | 1.514824e+06 | -2.024678 | 8.490811e-08 | 3616.057032 | 217446.743069 | 3600.880384 | 1 | 0 | 9.324346e+05 | 0.0 |
| 20 | 2015-11-10 | 0.934834 | 8.650315e+05 | 6.975349e+07 | 2015 | 11 | 10 | 1 | 0 | 8.086611e+05 | ... | 1.362982e+06 | -3.267527 | 7.720529e-08 | 2848.199618 | 152004.216716 | 2662.594474 | 1 | 0 | -1.107490e+06 | 0.0 |
| 21 | 2015-11-11 | 0.788761 | 1.243323e+06 | 5.887388e+07 | 2015 | 11 | 11 | 2 | 0 | 9.806847e+05 | ... | 1.078152e+06 | -6.349221 | 6.232706e-08 | 1651.086634 | 67089.536649 | 1302.313037 | 1 | 0 | 3.782912e+05 | 0.0 |
| 22 | 2015-11-12 | 0.900742 | 8.268297e+05 | 6.725489e+07 | 2015 | 11 | 12 | 3 | 0 | 7.447603e+05 | ... | 1.027427e+06 | 5.463497 | 6.918774e-08 | 1480.964306 | 56992.392294 | 1333.966850 | 1 | 0 | -4.164931e+05 | 0.0 |
| 23 | 2015-11-13 | 0.904082 | 5.450863e+05 | 6.752726e+07 | 2015 | 11 | 13 | 4 | 0 | 4.928024e+05 | ... | 9.778608e+05 | 0.160417 | 6.818871e-08 | 1872.153919 | 81005.093415 | 1692.579832 | 1 | 0 | -2.817434e+05 | 0.0 |
| 24 | 2015-11-14 | 0.884229 | 3.615861e+05 | 6.606671e+07 | 2015 | 11 | 14 | 5 | 0 | 3.197249e+05 | ... | 9.007257e+05 | -1.070312 | 6.590098e-08 | 2375.740770 | 115797.338042 | 2100.698660 | 1 | 0 | -1.835001e+05 | 0.0 |
| 25 | 2015-11-15 | 0.910826 | 4.320234e+05 | 6.807833e+07 | 2015 | 11 | 15 | 6 | 0 | 3.934980e+05 | ... | 8.055660e+05 | 1.220412 | 6.693542e-08 | 1646.221972 | 66793.252323 | 1499.421030 | 1 | 0 | 7.043729e+04 | 0.0 |
| 26 | 2015-11-16 | 0.933841 | 6.163469e+05 | 6.981283e+07 | 2015 | 11 | 16 | 0 | 0 | 5.755699e+05 | ... | 6.244834e+05 | 1.225320 | 6.728649e-08 | 2351.466388 | 114027.121982 | 2195.895215 | 1 | 0 | 1.843235e+05 | 0.0 |
| 27 | 2015-11-17 | 0.995273 | 1.128250e+06 | 7.444073e+07 | 2015 | 11 | 17 | 1 | 0 | 1.122917e+06 | ... | 6.644529e+05 | 3.256810 | 6.923762e-08 | 2450.942894 | 121338.826016 | 2439.358164 | 1 | 0 | 5.119026e+05 | 0.0 |
| 28 | 2015-11-18 | 0.994429 | 6.872033e+05 | 7.440272e+07 | 2015 | 11 | 18 | 2 | 0 | 6.833746e+05 | ... | 6.120467e+05 | -0.048805 | 6.775441e-08 | 3306.214627 | 190106.324049 | 3287.794316 | 1 | 0 | -4.410462e+05 | 0.0 |
| 29 | 2015-11-19 | 0.951471 | 4.343855e+05 | 7.121230e+07 | 2015 | 11 | 19 | 3 | 0 | 4.133053e+05 | ... | 5.641534e+05 | -2.307712 | 6.399463e-08 | 2853.885945 | 152459.650199 | 2715.390540 | 1 | 0 | -2.528179e+05 | 0.0 |
| 30 | 2015-11-20 | 0.926803 | 6.065603e+05 | 6.938837e+07 | 2015 | 11 | 20 | 4 | 0 | 5.621620e+05 | ... | 5.743795e+05 | -1.329536 | 6.123683e-08 | 2629.769695 | 134857.956504 | 2437.278756 | 1 | 0 | 1.721749e+05 | 0.0 |
| 31 | 2015-11-21 | 0.973572 | 4.549269e+05 | 7.291437e+07 | 2015 | 11 | 21 | 5 | 0 | 4.429043e+05 | ... | 5.948952e+05 | 2.704405 | 6.348780e-08 | 2872.078468 | 153919.786275 | 2796.176426 | 1 | 0 | -1.516334e+05 | 0.0 |
| 32 | 2015-11-22 | 0.964739 | 3.719629e+05 | 7.227634e+07 | 2015 | 11 | 22 | 6 | 0 | 3.588469e+05 | ... | 5.914305e+05 | -0.404938 | 6.224772e-08 | 1991.625162 | 88881.506412 | 1921.397536 | 1 | 0 | -8.296404e+04 | 0.0 |
| 33 | 2015-11-23 | 0.945220 | 4.371264e+05 | 7.083920e+07 | 2015 | 11 | 23 | 0 | 0 | 4.131808e+05 | ... | 5.677350e+05 | -0.925849 | 6.024115e-08 | 2094.210005 | 95836.367674 | 1979.489885 | 1 | 0 | 6.516351e+04 | 0.0 |
| 34 | 2015-11-24 | 0.899361 | 3.549141e+05 | 6.742430e+07 | 2015 | 11 | 24 | 1 | 0 | 3.191959e+05 | ... | 4.546287e+05 | -2.277925 | 5.675388e-08 | 2204.408532 | 103499.469795 | 1982.559361 | 1 | 0 | -8.221231e+04 | 0.0 |
| 35 | 2015-11-25 | 0.867707 | 7.088672e+05 | 6.507285e+07 | 2015 | 11 | 25 | 2 | 0 | 6.150891e+05 | ... | 4.488592e+05 | -2.057096 | 5.369997e-08 | 3416.007675 | 199654.110874 | 2964.094085 | 1 | 0 | 3.539531e+05 | 0.0 |
| 36 | 2015-11-26 | 0.894875 | 9.596965e+05 | 6.713206e+07 | 2015 | 11 | 26 | 3 | 0 | 8.588082e+05 | ... | 5.143551e+05 | 1.541474 | 5.397161e-08 | 2423.893373 | 119335.667578 | 2169.081027 | 1 | 0 | 2.508293e+05 | 0.0 |
| 37 | 2015-11-27 | 0.870713 | 3.985955e+05 | 6.534185e+07 | 2015 | 11 | 27 | 4 | 0 | 3.470622e+05 | ... | 4.826661e+05 | -1.238329 | 5.196498e-08 | 2103.447781 | 96471.182592 | 1831.498907 | 1 | 0 | -5.611010e+05 | 0.0 |
| 38 | 2015-11-28 | 0.917399 | 4.650649e+05 | 6.886877e+07 | 2015 | 11 | 28 | 5 | 0 | 4.266500e+05 | ... | 4.797563e+05 | 2.887629 | 5.409101e-08 | 2900.412988 | 156203.140755 | 2660.835324 | 1 | 0 | 6.646944e+04 | 0.0 |
| 39 | 2015-11-29 | 0.871745 | 4.369391e+05 | 6.546334e+07 | 2015 | 11 | 29 | 6 | 0 | 3.808995e+05 | ... | 4.810518e+05 | -2.271756 | 5.082339e-08 | 2083.940110 | 95132.267949 | 1816.664275 | 1 | 0 | -2.812581e+04 | 0.0 |
| 40 | 2015-11-30 | 0.873601 | 7.753725e+05 | 6.562485e+07 | 2015 | 11 | 30 | 0 | 0 | 6.773662e+05 | ... | 5.183211e+05 | 0.091471 | 4.993883e-08 | 1845.876779 | 79305.637343 | 1612.559675 | 1 | 0 | 3.384334e+05 | 0.0 |
| 41 | 2015-12-01 | 0.875004 | 6.431099e+05 | 6.575249e+07 | 2015 | 12 | 1 | 1 | 0 | 5.627237e+05 | ... | 5.525786e+05 | 0.054061 | 4.922323e-08 | 1133.031732 | 38138.456233 | 991.407295 | 1 | 0 | -1.322627e+05 | 0.0 |
| 42 | 2015-12-02 | 0.822734 | 4.942059e+05 | 6.184478e+07 | 2015 | 12 | 2 | 2 | 0 | 4.065999e+05 | ... | 5.217142e+05 | -1.764262 | 4.572375e-08 | 872.243691 | 25760.646354 | 717.624350 | 1 | 0 | -1.489040e+05 | 0.0 |
| 43 | 2015-12-03 | 0.824765 | 5.416811e+05 | 6.201881e+07 | 2015 | 12 | 3 | 3 | 0 | 4.467597e+05 | ... | 4.640805e+05 | 0.082131 | 4.523775e-08 | 1106.462467 | 36804.848095 | 912.571725 | 1 | 0 | 4.747524e+04 | 0.0 |
| 44 | 2015-12-04 | 0.838791 | 2.419475e+05 | 6.309433e+07 | 2015 | 12 | 4 | 4 | 0 | 2.029434e+05 | ... | 4.423760e+05 | 0.646870 | 4.574011e-08 | 1446.909749 | 55037.939230 | 1213.654805 | 1 | 0 | -2.997336e+05 | 0.0 |
| 45 | 2015-12-05 | 0.864584 | 2.260844e+05 | 6.505634e+07 | 2015 | 12 | 5 | 5 | 0 | 1.954689e+05 | ... | 4.093750e+05 | 1.054467 | 4.689238e-08 | 1175.919488 | 40324.257731 | 1016.680857 | 1 | 0 | -1.586308e+04 | 0.0 |
| 46 | 2015-12-06 | 0.834992 | 4.296813e+05 | 6.285026e+07 | 2015 | 12 | 6 | 6 | 0 | 3.587806e+05 | ... | 4.059763e+05 | -1.107406 | 4.482802e-08 | 1046.891522 | 33872.911238 | 874.146555 | 1 | 0 | 2.035969e+05 | 0.0 |
| 47 | 2015-12-07 | 0.800750 | 4.946576e+05 | 6.029248e+07 | 2015 | 12 | 7 | 0 | 0 | 3.960972e+05 | ... | 3.674121e+05 | -1.275784 | 4.249339e-08 | 967.823226 | 30108.842548 | 774.984723 | 1 | 0 | 6.497633e+04 | 0.0 |
| 48 | 2015-12-08 | 0.818853 | 4.359668e+05 | 6.167586e+07 | 2015 | 12 | 8 | 1 | 0 | 3.569926e+05 | ... | 3.393504e+05 | 0.873737 | 4.301638e-08 | 1493.762873 | 57732.782770 | 1223.171775 | 1 | 0 | -5.869081e+04 | 0.0 |
| 49 | 2015-12-09 | 0.791829 | 6.271289e+05 | 5.966107e+07 | 2015 | 12 | 9 | 2 | 0 | 4.965791e+05 | ... | 3.532086e+05 | -1.292357 | 4.100272e-08 | 1533.549977 | 60054.686047 | 1214.309879 | 1 | 0 | 1.911621e+05 | 0.0 |
50 rows × 44 columns
This section of the code demonstrates the creation of various advanced features derived from the basic columns in the ethereum_data DataFrame. These features aim to capture trends, cyclic behavior, and other complex relationships within the data that could be useful for predicting Ethereum prices.
```python
ed['price_7day_avg'] = ed['Price'].rolling(window=7).mean() ed['volume_7day_avg'] = ed['Volume'].rolling(window=7).mean()
ed['price_change_pct'] = ed['Price'].pct_change() 100 ed['volume_change_pct'] = ed['Volume'].pct_change() 100
ed['market_cap_volume_ratio'] = ed['Market Cap'] / ed['Volume']
ed['price_lag1'] = ed['Price'].shift(1) ed['volume_lag1'] = ed['Volume'].shift(1)
ed['price_ema_short'] = ed['Price'].ewm(span=12, adjust=False).mean()
ed['price_ema_long'] = ed['Price'].ewm(span=26, adjust=False).mean()
delta = ed['Price'].diff() up, down = delta.copy(), delta.copy() up[up < 0] = 0 down[down > 0] = 0 ed['rsi'] = 100 - (100 / (1 + up.rolling(window=14).mean() / down.abs().rolling(window=14).mean()))
ed['week_of_year'] = pd.to_datetime(ed['date']).dt.isocalendar().week ed['quarter'] = pd.to_datetime(ed['date']).dt.quarter ed['days_since_launch'] = (pd.to_datetime(ed['date']) - pd.to_datetime(ed['date']).min()).dt.days
ed['cumulative_return'] = (1 + ed['Price'].pct_change()).cumprod() ed['cumulative_volume'] = ed['Volume'].cumsum()
ed['price_ma_ratio'] = ed['Price'] / ed['Price'].rolling(window=20).mean()
ed['normalized_price'] = ed['Price'] / ed['Price'].max()
print(ethereum_data.columns)
LTSM Model
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
ethereum_data['date'] = pd.to_datetime(ethereum_data['date'])
ethereum_data.sort_values('date', inplace=True)
features = ethereum_data[['Price', 'Volume', 'Market Cap']]
target = ethereum_data['Price']
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_features = scaler.fit_transform(features)
scaled_target = scaler.fit_transform(target.values.reshape(-1, 1))
def create_dataset(X, y, time_step=1):
Xs, ys = [], []
for i in range(len(X) - time_step):
v = X[i:(i + time_step)]
Xs.append(v)
ys.append(y[i + time_step])
return np.array(Xs), np.array(ys)
time_step = 10
X, y = create_dataset(scaled_features, scaled_target, time_step)
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
lstm_units = 50
dropout_rate = 0.2
model = Sequential()
model.add(LSTM(units=lstm_units, return_sequences=True, input_shape=(time_step, X.shape[2])))
model.add(LSTM(units=lstm_units))
model.add(Dropout(rate=dropout_rate))
model.add(Dense(units=1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.summary()
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs)
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ lstm_2 (LSTM) │ (None, 10, 50) │ 10,800 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ lstm_3 (LSTM) │ (None, 50) │ 20,200 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_1 (Dropout) │ (None, 50) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 1) │ 51 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 31,051 (121.29 KB)
Trainable params: 31,051 (121.29 KB)
Non-trainable params: 0 (0.00 B)
numpy and pandas are used for data manipulation. MinMaxScaler from sklearn.preprocessing is used to scale the data, which helps in normalizing the input features/labels within a bounded range and is generally a good practice for neural network algorithms. Sequential from tensorflow.keras.models and layers like LSTM, Dense, and Dropout from tensorflow.keras.layers are used to build the LSTM model.
ethereum_data['date'] = pd.to_datetime(ethereum_data['date'])
ethereum_data.sort_values('date', inplace=True)
features = ethereum_data[['Price', 'Volume', 'Market Cap']]
target = ethereum_data['Price']
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_features = scaler.fit_transform(features)
scaled_target = scaler.fit_transform(target.values.reshape(-1, 1))
Converts the 'date' column to a datetime object and sorts the DataFrame based on date to ensure that the sequence is in chronological order. Extracts relevant features and the target variable ('Price') for model training. Scales the features and target variable between 0 and 1 to facilitate faster convergence during training.
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape=(time_step, X.shape[2])))
model.add(LSTM(50))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.summary()
Model: "sequential_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ lstm_4 (LSTM) │ (None, 10, 50) │ 10,800 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ lstm_5 (LSTM) │ (None, 50) │ 20,200 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_2 (Dropout) │ (None, 50) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_2 (Dense) │ (None, 1) │ 51 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 31,051 (121.29 KB)
Trainable params: 31,051 (121.29 KB)
Non-trainable params: 0 (0.00 B)
Splits the data into training and testing sets. Constructs a Sequential LSTM model with two LSTM layers to capture complex relationships in the data. A Dropout layer is included to prevent overfitting, followed by a Dense output layer to predict the continuous value (Ethereum price). The model uses 'mean_squared_error' as the loss function and 'adam' optimizer, which is commonly used in regression problems like this.
from optuna import Trial
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dropout, Dense, Input
from tensorflow.keras.optimizers import Adam
def build_model(trial: Trial):
tf.keras.backend.clear_session()
model = Sequential([
Input(shape=(time_step, X.shape[2])),
LSTM(trial.suggest_categorical('lstm_units', [50, 100, 150]), return_sequences=True),
LSTM(trial.suggest_categorical('lstm_units', [50, 100, 150])),
Dropout(trial.suggest_float('dropout_rate', 0.1, 0.5)),
Dense(1)
])
lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
model.compile(optimizer=Adam(learning_rate=lr), loss='mean_squared_error')
return model
def objective(trial: Trial):
model = build_model(trial)
history = model.fit(X_train, y_train, validation_split=0.2, epochs=50, batch_size=64, verbose=0)
loss = model.evaluate(X_test, y_test, verbose=0)
return loss
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=50, timeout=600)
print("Best trial:")
trial = study.best_trial
print(f"Value: {trial.value}")
print("Params: ")
for key, value in trial.params.items():
print(f"{key}: {value}")
[I 2024-05-01 22:21:15,208] A new study created in memory with name: no-name-ecf34e47-17a2-40ab-a0e6-17ff04acab20
[I 2024-05-01 22:22:32,664] Trial 0 finished with value: 0.0007128661382012069 and parameters: {'lstm_units': 150, 'dropout_rate': 0.3900353930656907, 'lr': 0.006665058671399883}. Best is trial 0 with value: 0.0007128661382012069.
[I 2024-05-01 22:23:38,356] Trial 1 finished with value: 0.0005467506707645953 and parameters: {'lstm_units': 100, 'dropout_rate': 0.48366893947431777, 'lr': 0.00017547692320694214}. Best is trial 1 with value: 0.0005467506707645953.
[I 2024-05-01 22:24:22,317] Trial 2 finished with value: 0.02268524281680584 and parameters: {'lstm_units': 100, 'dropout_rate': 0.12746841967209088, 'lr': 0.04212107130866591}. Best is trial 1 with value: 0.0005467506707645953.
[I 2024-05-01 22:25:17,107] Trial 3 finished with value: 0.002532100537791848 and parameters: {'lstm_units': 100, 'dropout_rate': 0.1736868828938726, 'lr': 1.1849615328748276e-05}. Best is trial 1 with value: 0.0005467506707645953.
[I 2024-05-01 22:26:05,849] Trial 4 finished with value: 0.0018641131464391947 and parameters: {'lstm_units': 100, 'dropout_rate': 0.1397727728687151, 'lr': 2.2219403053770703e-05}. Best is trial 1 with value: 0.0005467506707645953.
[I 2024-05-01 22:26:52,823] Trial 5 finished with value: 0.0005035395734012127 and parameters: {'lstm_units': 100, 'dropout_rate': 0.3717713526400662, 'lr': 0.0003587859917045765}. Best is trial 5 with value: 0.0005035395734012127.
[I 2024-05-01 22:27:14,318] Trial 6 finished with value: 0.013612092472612858 and parameters: {'lstm_units': 50, 'dropout_rate': 0.19570302507032103, 'lr': 0.054390304448723954}. Best is trial 5 with value: 0.0005035395734012127.
[I 2024-05-01 22:27:35,364] Trial 7 finished with value: 0.0006943660555407405 and parameters: {'lstm_units': 50, 'dropout_rate': 0.42498911739570355, 'lr': 0.00024626257778208375}. Best is trial 5 with value: 0.0005035395734012127.
[I 2024-05-01 22:28:49,605] Trial 8 finished with value: 0.00172930839471519 and parameters: {'lstm_units': 150, 'dropout_rate': 0.12686054201625507, 'lr': 1.1743802485863199e-05}. Best is trial 5 with value: 0.0005035395734012127.
[I 2024-05-01 22:29:30,586] Trial 9 finished with value: 0.0008990837959572673 and parameters: {'lstm_units': 100, 'dropout_rate': 0.482405133666325, 'lr': 6.92196342074409e-05}. Best is trial 5 with value: 0.0005035395734012127.
[I 2024-05-01 22:29:53,351] Trial 10 finished with value: 0.0006328715244308114 and parameters: {'lstm_units': 50, 'dropout_rate': 0.3017412034512606, 'lr': 0.001807707131274098}. Best is trial 5 with value: 0.0005035395734012127.
[I 2024-05-01 22:30:39,398] Trial 11 finished with value: 0.000498659152071923 and parameters: {'lstm_units': 100, 'dropout_rate': 0.3452337125154388, 'lr': 0.0006220505444542745}. Best is trial 11 with value: 0.000498659152071923.
[I 2024-05-01 22:31:24,501] Trial 12 finished with value: 0.0009540743776597083 and parameters: {'lstm_units': 100, 'dropout_rate': 0.32659224940013326, 'lr': 0.0008798325039893869}. Best is trial 11 with value: 0.000498659152071923.
Best trial: Value: 0.000498659152071923 Params: lstm_units: 100 dropout_rate: 0.3452337125154388 lr: 0.0006220505444542745
Optuna is a hyperparameter optimization library to automate hyperparameter search. TensorFlow and its Keras API are used to build and train the LSTM model. Adam optimizer is commonly used due to its efficiency in handling sparse gradients on noisy problems. The build_model function constructs a neural network based on the hyperparameters suggested by Optuna during the trials. It clears any existing TensorFlow backend session to ensure that the model starts with a clean state on each trial. The LSTM layers are dynamically configured to have units based on trial suggestions, which can significantly impact learning capacity and performance. Dropout is used to randomly set a fraction of input units to 0 at each update during training time, which helps to prevent overfitting.
Defining the Objective Function
The objective function orchestrates the training and evaluation process. It takes a trial object which provides suggestions for the hyperparameters and returns the loss of the model on the test set, which Optuna tries to minimize.
Running the Optimization
A study object is created with the goal to 'minimize' the loss. Optuna supports both minimization and maximization. The optimize method of the study object runs the optimization for a defined number of trials (n_trials) or until a certain time limit is reached (timeout).
Displaying the Best Trial
After the optimization, the best trial is accessed via study.best_trial. It displays the best performance observed during the optimization and the hyperparameters that led to that performance.
Conclusion
Using Optuna for hyperparameter optimization can lead to significant improvements in model performance by systematically searching through multiple combinations of hyperparameter settings. This process is crucial for fine-tuning deep learning models and can be particularly beneficial in complex domains such as financial time series forecasting.
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=64, verbose=1)
Epoch 1/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 4s 19ms/step - loss: 0.0348 - val_loss: 0.0016 Epoch 2/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 0.0017 - val_loss: 6.6941e-04 Epoch 3/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0016 - val_loss: 0.0011 Epoch 4/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 0.0012 - val_loss: 9.7776e-04 Epoch 5/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 0.0013 - val_loss: 8.9295e-04 Epoch 6/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step - loss: 0.0013 - val_loss: 6.4354e-04 Epoch 7/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - loss: 0.0013 - val_loss: 5.5434e-04 Epoch 8/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - loss: 0.0012 - val_loss: 9.5105e-04 Epoch 9/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 2s 44ms/step - loss: 0.0011 - val_loss: 0.0021 Epoch 10/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 3s 51ms/step - loss: 0.0013 - val_loss: 5.7144e-04 Epoch 11/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 0.0011 - val_loss: 0.0011 Epoch 12/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 9.1258e-04 - val_loss: 6.4081e-04 Epoch 13/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 0.0010 - val_loss: 6.6901e-04 Epoch 14/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 0.0011 - val_loss: 6.2649e-04 Epoch 15/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 9.9424e-04 - val_loss: 0.0016 Epoch 16/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - loss: 0.0011 - val_loss: 4.5030e-04 Epoch 17/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 0.0011 - val_loss: 0.0017 Epoch 18/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 9.3172e-04 - val_loss: 9.9930e-04 Epoch 19/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 9.7901e-04 - val_loss: 4.6408e-04 Epoch 20/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 9.0177e-04 - val_loss: 7.9490e-04 Epoch 21/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - loss: 9.5421e-04 - val_loss: 8.0474e-04 Epoch 22/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0011 - val_loss: 5.7729e-04 Epoch 23/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 7.0903e-04 - val_loss: 0.0017 Epoch 24/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 8.8002e-04 - val_loss: 0.0017 Epoch 25/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 8.5928e-04 - val_loss: 8.5455e-04 Epoch 26/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 7.9640e-04 - val_loss: 0.0013 Epoch 27/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - loss: 7.4214e-04 - val_loss: 3.7975e-04 Epoch 28/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - loss: 8.6599e-04 - val_loss: 6.0650e-04 Epoch 29/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - loss: 8.6641e-04 - val_loss: 0.0013 Epoch 30/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 8.0409e-04 - val_loss: 0.0011 Epoch 31/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 7.2854e-04 - val_loss: 3.1957e-04 Epoch 32/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 7.3699e-04 - val_loss: 7.3082e-04 Epoch 33/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step - loss: 6.1679e-04 - val_loss: 3.6806e-04 Epoch 34/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 7.9721e-04 - val_loss: 3.0742e-04 Epoch 35/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 39ms/step - loss: 7.5590e-04 - val_loss: 2.9894e-04 Epoch 36/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - loss: 6.8833e-04 - val_loss: 7.4489e-04 Epoch 37/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - loss: 7.0217e-04 - val_loss: 0.0011 Epoch 38/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step - loss: 7.3997e-04 - val_loss: 0.0011 Epoch 39/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 7.8344e-04 - val_loss: 0.0013 Epoch 40/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 8.2099e-04 - val_loss: 3.4394e-04 Epoch 41/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 8.1899e-04 - val_loss: 2.7239e-04 Epoch 42/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - loss: 7.4848e-04 - val_loss: 2.9717e-04 Epoch 43/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 7.0904e-04 - val_loss: 3.7375e-04 Epoch 44/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 7.2711e-04 - val_loss: 3.4271e-04 Epoch 45/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 7.9872e-04 - val_loss: 5.4607e-04 Epoch 46/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - loss: 7.8388e-04 - val_loss: 3.1085e-04 Epoch 47/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 8.0666e-04 - val_loss: 0.0010 Epoch 48/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 5.6177e-04 - val_loss: 2.8615e-04 Epoch 49/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 7.2094e-04 - val_loss: 5.3703e-04 Epoch 50/50 37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - loss: 5.6650e-04 - val_loss: 5.6847e-04
<keras.src.callbacks.history.History at 0x23599032e90>
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.models import Sequential
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import numpy as np
def train_and_evaluate_model(X_train, y_train, X_test, y_test, lstm_units, num_layers, epochs=20, batch_size=64):
model = Sequential()
model.add(LSTM(lstm_units, return_sequences=(num_layers > 1), input_shape=(X_train.shape[1], X_train.shape[2])))
for i in range(num_layers - 1):
model.add(LSTM(lstm_units, return_sequences=(i < num_layers - 2)))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_test, y_test), verbose=0, shuffle=False)
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)
train_rmse = np.sqrt(mean_squared_error(y_train, train_predict))
test_rmse = np.sqrt(mean_squared_error(y_test, test_predict))
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.title(f'Training and Validation loss (Units: {lstm_units}, Layers: {num_layers})')
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import numpy as np
def train_and_evaluate_model(X_train, y_train, X_test, y_test, lstm_units, dropout_rate, learning_rate, epochs=50, batch_size=64):
model = Sequential()
model.add(LSTM(lstm_units, return_sequences=True, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dropout(dropout_rate))
model.add(LSTM(lstm_units))
model.add(Dropout(dropout_rate))
model.add(Dense(1))
optimizer = Adam(learning_rate=learning_rate)
model.compile(loss='mean_squared_error', optimizer=optimizer)
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_test, y_test), verbose=1, shuffle=False)
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)
train_rmse = np.sqrt(mean_squared_error(y_train, train_predict))
test_rmse = np.sqrt(mean_squared_error(y_test, test_predict))
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Test Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.show()
return train_rmse, test_rmse
lstm_units = 100 # From Optuna
dropout_rate = 0.3452337125154388 # From Optuna
learning_rate = 0.0006220505444542745 # From Optuna
train_rmse, test_rmse = train_and_evaluate_model(
X_train, y_train, X_test, y_test,
lstm_units=lstm_units,
dropout_rate=dropout_rate,
learning_rate=learning_rate
)
print(f"Train RMSE: {train_rmse}, Test RMSE: {test_rmse}")
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step 19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Units: 50, Layers: 1 => Train RMSE: 0.03512533177352272, Test RMSE: 0.020643031084390882
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs)
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step 19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
Units: 50, Layers: 2 => Train RMSE: 0.0637469685834429, Test RMSE: 0.028491347000281594
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs)
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step 19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
Units: 50, Layers: 3 => Train RMSE: 0.046748635757175284, Test RMSE: 0.026319640058699755
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs)
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step 19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Units: 100, Layers: 1 => Train RMSE: 0.019572349799368467, Test RMSE: 0.021077017916277595
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs)
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step 19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
Units: 100, Layers: 2 => Train RMSE: 0.07661178268558169, Test RMSE: 0.023474108356920143
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs)
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step 19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
Units: 100, Layers: 3 => Train RMSE: 0.04039341945663431, Test RMSE: 0.034419499330793774
Explanation: Model Architecture: This setup uses a two-layer LSTM model with dropout after each LSTM layer. The return_sequences=True parameter in the first LSTM layer allows for sequence generation necessary for the following LSTM layer input. Optimizer Configuration: Uses the Adam optimizer with a learning rate specified by Optuna's results. Training and Evaluation: The model is trained using the given training set and evaluated on the test set, with a verbose output to track progress visually during training. Visualization: Plots the training and test loss over epochs to visualize the model's learning progress.
Next, let's conduct a correlation analysis to identify potential linear relationships between different features within the Ethereum dataset. This can help in understanding the dependencies between different variables.
We'll use seaborn for the heatmap and numpy for numerical operations.
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)
train_predict = scaler.inverse_transform(train_predict)
test_predict = scaler.inverse_transform(test_predict)
actual_y_train = scaler.inverse_transform(y_train)
actual_y_test = scaler.inverse_transform(y_test)
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(actual_y_test, label='Actual Price')
plt.plot(test_predict, label='Predicted Price', alpha=0.7)
plt.title('Ethereum Price Prediction')
plt.xlabel('Time')
plt.ylabel('Ethereum Price')
plt.legend()
plt.show()
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step 19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
Next, let's conduct a correlation analysis to identify potential linear relationships between different features within the Ethereum dataset. This can help in understanding the dependencies between different variables.
We'll use seaborn for the heatmap and numpy for numerical operations.
import seaborn as sns
import matplotlib.pyplot as plt
corr = ethereum_data.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize=(15, 10))
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm', cbar_kws={"shrink": .5})
plt.title('Correlation Matrix Heatmap')
plt.show()
C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\1223983659.py:4: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. corr = ethereum_data.corr()
import plotly.graph_objects as go
corr = ethereum_data.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
corr_masked = corr.where(~mask, None)
fig = go.Figure(data=go.Heatmap(
z=corr_masked,
x=corr.columns,
y=corr.columns,
colorscale='RdBu',
zmin=-1,
zmax=1
))
fig.update_layout(
title='Correlation Matrix Heatmap',
width=800,
height=800,
xaxis_showgrid=False,
yaxis_showgrid=False,
yaxis_autorange='reversed'
)
fig.show()
C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\3844861659.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. corr = ethereum_data.corr()
Initially, we identify any missing values across each column of the dataset. Then, we fill these missing values with the median of their respective columns, which is a common practice to avoid bias caused by outliers.
print("NaN counts in each column:\n", ethereum_data.isnull().sum())
ethereum_data.fillna(ethereum_data.median(), inplace=True)
print("NaN counts after handling:\n", ethereum_data.isnull().sum())
NaN counts in each column: date 0 Price 0 Volume 0 Market Cap 0 year 0 month 0 day 0 weekday 0 hour 0 price_volume_interaction 0 marketcap_volume_ratio 0 price_change 1 volume_change 1 price_7day_avg 6 volume_7day_avg 6 price_change_pct 1 volume_change_pct 1 market_cap_volume_ratio 0 price_lag1 1 volume_lag1 1 price_ema_short 0 price_ema_long 0 rsi 14 week_of_year 0 quarter 0 days_since_launch 0 cumulative_return 1 cumulative_volume 0 price_ma_ratio 19 normalized_price 0 price_x_volume 0 marketcap_per_volume 0 price_squared 0 price_change_pct_x_volume 1 ma_price_x_ma_volume 6 rsi_x_price_change_pct 14 return_volume_ratio 1 rsi_squared 14 rsi_cubed 14 rsi_squared_x_price 14 is_Q4 0 is_start_of_year 0 Q4_volume_change 1 high_volume_price_change 0 dtype: int64 NaN counts after handling: date 0 Price 0 Volume 0 Market Cap 0 year 0 month 0 day 0 weekday 0 hour 0 price_volume_interaction 0 marketcap_volume_ratio 0 price_change 0 volume_change 0 price_7day_avg 0 volume_7day_avg 0 price_change_pct 0 volume_change_pct 0 market_cap_volume_ratio 0 price_lag1 0 volume_lag1 0 price_ema_short 0 price_ema_long 0 rsi 0 week_of_year 0 quarter 0 days_since_launch 0 cumulative_return 0 cumulative_volume 0 price_ma_ratio 0 normalized_price 0 price_x_volume 0 marketcap_per_volume 0 price_squared 0 price_change_pct_x_volume 0 ma_price_x_ma_volume 0 rsi_x_price_change_pct 0 return_volume_ratio 0 rsi_squared 0 rsi_cubed 0 rsi_squared_x_price 0 is_Q4 0 is_start_of_year 0 Q4_volume_change 0 high_volume_price_change 0 dtype: int64
C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\2861788634.py:2: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version. C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\2861788634.py:2: DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
Columns with zero variance (i.e., the same value across all entries) do not contribute to the model's predictive capability and are therefore removed from the dataset.
zero_var_cols = ethereum_data.columns[ethereum_data.nunique() <= 1]
print("Columns with zero variance:", zero_var_cols)
ethereum_data.drop(columns=zero_var_cols, inplace=True)
Columns with zero variance: Index(['hour'], dtype='object')
Next, we replace infinite values within our numeric columns with NaN, which allows for more consistent data filling. Following this, we refill these values using the median. We then attempt to calculate the Z-scores to identify outliers, defined as those beyond 3 standard deviations from the mean.
import numpy as np
from scipy import stats
numeric_cols = ethereum_data.select_dtypes(include=[np.number]).columns
ethereum_data[numeric_cols] = ethereum_data[numeric_cols].replace([np.inf, -np.inf], np.nan)
ethereum_data[numeric_cols] = ethereum_data[numeric_cols].fillna(ethereum_data[numeric_cols].median())
try:
z_scores = np.abs(stats.zscore(ethereum_data[numeric_cols]))
outliers_z = (z_scores > 3).any(axis=1)
print("Detected outliers by Z-Score:", ethereum_data[outliers_z].shape[0])
except Exception as e:
print("Error recalculating Z-scores:", str(e))
for column in numeric_cols:
try:
z_score = np.abs(stats.zscore(ethereum_data[column].dropna()))
outlier = (z_score > 3)
print(f"Outliers in {column}: {outlier.sum()}")
except Exception as e:
print(f"Error processing column {column}: {str(e)}")
Error recalculating Z-scores: loop of ufunc does not support argument 0 of type float which has no callable sqrt method Outliers in Price: 37 Outliers in Volume: 48 Outliers in Market Cap: 36 Outliers in year: 0 Outliers in month: 0 Outliers in day: 0 Outliers in weekday: 0 Outliers in price_volume_interaction: 49 Outliers in marketcap_volume_ratio: 3 Outliers in price_change: 72 Outliers in volume_change: 42 Outliers in price_7day_avg: 43 Outliers in volume_7day_avg: 45 Outliers in price_change_pct: 50 Outliers in volume_change_pct: 3 Outliers in market_cap_volume_ratio: 3 Outliers in price_lag1: 37 Outliers in volume_lag1: 48 Outliers in price_ema_short: 41 Outliers in price_ema_long: 40 Outliers in rsi: 0 Error processing column week_of_year: loop of ufunc does not support argument 0 of type float which has no callable sqrt method Outliers in quarter: 0 Outliers in days_since_launch: 0 Outliers in cumulative_return: 37 Outliers in cumulative_volume: 0 Outliers in price_ma_ratio: 49 Outliers in normalized_price: 37 Outliers in price_x_volume: 49 Outliers in marketcap_per_volume: 3 Outliers in price_squared: 94 Outliers in price_change_pct_x_volume: 41 Outliers in ma_price_x_ma_volume: 40 Outliers in rsi_x_price_change_pct: 67 Outliers in return_volume_ratio: 18 Outliers in rsi_squared: 13 Outliers in rsi_cubed: 38 Outliers in rsi_squared_x_price: 94 Outliers in is_Q4: 0 Outliers in is_start_of_year: 248 Outliers in Q4_volume_change: 61 Outliers in high_volume_price_change: 75
Finally, we analyze outliers in each numeric column individually to understand the distribution and identify extreme values that could affect the performance of predictive models.
import numpy as np
from scipy import stats
from scipy.stats.mstats import winsorize
numeric_cols = ethereum_data.select_dtypes(include=[np.number])
numeric_cols.replace([np.inf, -np.inf], np.nan, inplace=True)
numeric_cols.dropna(inplace=True)
std_devs = numeric_cols.std()
print("Standard deviations of numeric columns:", std_devs)
valid_cols = std_devs[std_devs > 0].index
valid_numeric_data = numeric_cols[valid_cols].astype(float)
try:
z_scores = np.abs(stats.zscore(valid_numeric_data, nan_policy='omit'))
outliers_z = (z_scores > 3).any(axis=1)
print("Detected outliers by Z-Score without error:", outliers_z.sum())
except Exception as e:
print("Error recalculating Z-scores:", str(e))
Standard deviations of numeric columns: Price 1.091812e+03 Volume 1.227600e+10 Market Cap 1.304226e+11 year 2.351049e+00 month 3.462797e+00 day 8.810793e+00 weekday 1.999072e+00 price_volume_interaction 3.323323e+13 marketcap_volume_ratio 1.436267e+02 price_change 6.456290e+01 volume_change 5.630875e+09 price_7day_avg 1.089073e+03 volume_7day_avg 1.152035e+10 price_change_pct 5.477677e-02 volume_change_pct 3.052744e+02 market_cap_volume_ratio 1.436267e+02 price_lag1 1.091659e+03 volume_lag1 1.227466e+10 price_ema_short 1.086028e+03 price_ema_long 1.079339e+03 rsi 1.824692e+01 week_of_year 1.514276e+01 quarter 1.123532e+00 days_since_launch 8.560315e+02 cumulative_return 2.482507e+03 cumulative_volume 1.020081e+13 price_ma_ratio 1.544854e-01 normalized_price 2.267520e-01 price_x_volume 3.323323e+13 marketcap_per_volume 1.436267e+02 price_squared 3.854817e+06 price_change_pct_x_volume 1.326119e+09 ma_price_x_ma_volume 3.221227e+13 rsi_x_price_change_pct 3.221176e+00 return_volume_ratio 2.675421e-08 rsi_squared 1.986515e+03 rsi_cubed 1.819791e+05 rsi_squared_x_price 4.542350e+06 is_Q4 4.402400e-01 is_start_of_year 2.769401e-01 Q4_volume_change 1.868019e+09 high_volume_price_change 6.274193e+01 dtype: float64 Detected outliers by Z-Score without error: 613
print(ethereum_data.select_dtypes(include=[np.number]).dtypes)
Price float64 Volume float64 Market Cap float64 year int64 month int64 day int64 weekday int64 price_volume_interaction float64 marketcap_volume_ratio float64 price_change float64 volume_change float64 price_7day_avg float64 volume_7day_avg float64 price_change_pct float64 volume_change_pct float64 market_cap_volume_ratio float64 price_lag1 float64 volume_lag1 float64 price_ema_short float64 price_ema_long float64 rsi float64 week_of_year UInt32 quarter int64 days_since_launch int64 cumulative_return float64 cumulative_volume float64 price_ma_ratio float64 normalized_price float64 price_x_volume float64 marketcap_per_volume float64 price_squared float64 price_change_pct_x_volume float64 ma_price_x_ma_volume float64 rsi_x_price_change_pct float64 return_volume_ratio float64 rsi_squared float64 rsi_cubed float64 rsi_squared_x_price float64 is_Q4 int64 is_start_of_year int64 Q4_volume_change float64 high_volume_price_change float64 dtype: object
ethereum_data.replace([np.inf, -np.inf], np.nan, inplace=True)
ethereum_data.fillna(ethereum_data.median(), inplace=True)
C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\3595528556.py:2: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version. C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\3595528556.py:2: DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
Outlier Detection and Removal We utilize Z-scores to identify outliers. Here, outliers are defined as observations that are more than 3 standard deviations away from the mean.
import numpy as np
from scipy import stats
from scipy.stats.mstats import winsorize
numeric_cols = ethereum_data.select_dtypes(include=[np.number])
numeric_cols.replace([np.inf, -np.inf], np.nan, inplace=True)
numeric_cols.dropna(inplace=True)
std_devs = numeric_cols.std()
constant_cols = std_devs[std_devs == 0].index
valid_cols = std_devs[std_devs > 0].index
if not constant_cols.empty:
ethereum_data.drop(columns=constant_cols, inplace=True)
print("Dropped constant columns:", constant_cols)
valid_numeric_data = numeric_cols[valid_cols].astype(float)
try:
z_scores = np.abs(stats.zscore(valid_numeric_data, nan_policy='omit'))
outliers_z = (z_scores > 3).any(axis=1)
print("Detected outliers by Z-Score:", outliers_z.sum())
except Exception as e:
print("Error recalculating Z-scores:", str(e))
Detected outliers by Z-Score: 613
Winsorizing Data
To further mitigate the effect of extreme values, we apply winsorization to cap extreme values at both ends of the distribution.
filtered_data = valid_numeric_data[~outliers_z]
print(f"Data after removing outliers has {filtered_data.shape[0]} records out of {valid_numeric_data.shape[0]} original records.")
from scipy.stats.mstats import winsorize
winsorized_data = valid_numeric_data.apply(lambda x: winsorize(x, limits=[0.05, 0.05]))
transformed_data = valid_numeric_data.copy()
transformed_data = transformed_data.apply(lambda x: np.log(x + 1) if np.all(x > 0) else x)
Data after removing outliers has 2351 records out of 2964 original records.
print(filtered_data.describe())
Price Volume Market Cap year month \
count 2351.000000 2.351000e+03 2.351000e+03 2351.000000 2351.000000
mean 747.455660 7.704654e+09 8.652092e+10 2019.236070 7.049341
std 848.214629 8.461588e+09 1.019509e+11 2.347836 3.108005
min 0.439769 1.692088e+05 3.259030e+07 2015.000000 2.000000
25% 145.160874 6.050580e+08 1.555304e+10 2017.000000 4.000000
50% 300.570344 6.067278e+09 2.941204e+10 2019.000000 7.000000
75% 1520.467072 1.157127e+10 1.814142e+11 2021.000000 10.000000
max 3644.405517 4.495794e+10 4.320824e+11 2023.000000 12.000000
day weekday price_volume_interaction \
count 2351.000000 2351.000000 2.351000e+03
mean 15.816674 2.994470 1.007741e+13
std 8.765891 2.005198 1.804724e+13
min 1.000000 0.000000 1.473332e+05
25% 8.000000 1.000000 1.690774e+11
50% 16.000000 3.000000 1.404763e+12
75% 23.000000 5.000000 1.150871e+13
max 31.000000 6.000000 1.089137e+14
marketcap_volume_ratio price_change ... ma_price_x_ma_volume \
count 2351.000000 2351.000000 ... 2.351000e+03
mean 38.811834 0.546849 ... 1.050983e+13
std 54.357068 36.972100 ... 1.859886e+13
min 0.611666 -186.887860 ... 2.370337e+05
25% 4.780971 -5.804794 ... 1.780359e+11
50% 19.145416 -0.006202 ... 1.476849e+12
75% 49.499925 7.184352 ... 1.226244e+13
max 419.478183 180.953324 ... 1.103269e+14
rsi_x_price_change_pct return_volume_ratio rsi_squared \
count 2351.000000 2.351000e+03 2351.000000
mean 0.239944 4.277942e-09 2855.954230
std 2.213093 9.202788e-09 1765.815076
min -8.585443 5.143438e-11 19.410972
25% -0.887301 1.457301e-10 1419.928232
50% -0.007796 3.175254e-10 2534.723946
75% 1.081301 5.009221e-09 4034.377010
max 10.074481 8.490811e-08 8230.783038
rsi_cubed rsi_squared_x_price is_Q4 is_start_of_year \
count 2351.000000 2.351000e+03 2351.000000 2351.0
mean 174155.280081 2.062944e+06 0.266695 0.0
std 153430.319760 2.813506e+06 0.442326 0.0
min 85.520637 2.214744e+02 0.000000 0.0
25% 53505.746367 1.879255e+05 0.000000 0.0
50% 127613.318242 8.361186e+05 0.000000 0.0
75% 256250.520678 2.742979e+06 1.000000 0.0
max 746726.786952 1.556261e+07 1.000000 0.0
Q4_volume_change high_volume_price_change
count 2.351000e+03 2351.000000
mean 1.086705e+06 0.825408
std 8.633646e+08 35.290293
min -5.536621e+09 -186.887860
25% 0.000000e+00 0.000000
50% 0.000000e+00 0.000000
75% -0.000000e+00 0.174759
max 5.550065e+09 180.953324
[8 rows x 42 columns]
This section demonstrates various ways to visualize Ethereum data using Python libraries such as Seaborn, Plotly, and Dash. Each block of code is designed to provide insights into different aspects of Ethereum's price and other attributes over time.
Here, we create a simple line plot using Seaborn to visualize how Ethereum's price has changed over time. This can help in identifying trends or significant changes in the market.
Line Plot: Price over Time
import seaborn as sns
import matplotlib.pyplot as plt
ed['date'] = pd.to_datetime(ed['date'])
ed = ed.sort_values(by='date')
sns.set(style="darkgrid")
plt.figure(figsize=(10, 6))
sns.lineplot(x='date', y='Price', data=ed)
plt.title('Price over Time')
plt.xticks(rotation=45)
plt.show()
Interactive Line Plot: Price over Time with Plotly Using Plotly, we can make the visualization interactive, which is particularly useful for web-based dashboards.
import plotly.express as px
import pandas as pd
ed['date'] = pd.to_datetime(ed['date'])
ed = ed.sort_values(by='date')
fig = px.line(ed, x='date', y='Price', title='Price Trend Over Time',
labels={'date': 'Date', 'Price': 'Price in USD'},
line_shape='linear',
render_mode='svg')
fig.update_layout(
title_font_size=20,
title_x=0.5,
xaxis_title_font_size=14,
yaxis_title_font_size=14,
xaxis_tickangle=-45
)
fig.show()
Interactive Dashboard with Dash
This example sets up a basic Dash application that allows users to select different features of Ethereum data to display on a line chart.
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.express as px
import pandas as pd
app = dash.Dash(__name__)
app.layout = html.Div([
dcc.Graph(id='time-series-chart'),
html.Label('Select Feature:'),
dcc.Dropdown(
id='feature-selector',
options=[{'label': i, 'value': i} for i in ethereum_data.columns if i not in ['Timestamp', 'year', 'month', 'day', 'weekday', 'hour']],
value='Price'
)
])
@app.callback(
Output('time-series-chart', 'figure'),
[Input('feature-selector', 'value')]
)
def update_graph(selected_feature):
fig = px.line(ethereum_data, x='date', y=selected_feature, title=f'{selected_feature} Over Time')
return fig
if __name__ == '__main__':
app.run_server(debug=True, port=8052)
3D Scatter Plot: Price, Volume, and Market Cap
This Plotly visualization creates a 3D scatter plot to examine the relationships between price, volume, and market cap across different years.
import plotly.express as px
fig = px.scatter_3d(
ethereum_data,
x='Price',
y='Volume',
z='Market Cap',
color='year'
)
fig.update_layout(
title='3D Scatter Plot of Ethereum Price, Volume, and Market Cap',
scene=dict(
xaxis_title='Price USD',
yaxis_title='Volume',
zaxis_title='Market Cap'
)
)
fig.show()
Bar Plot: Average Price per Month
plt.figure(figsize=(10, 6))
sns.barplot(x='month', y='Price', data=ed)
plt.title('Average Price per Month')
plt.show()
Histogram: Price Distribution
plt.figure(figsize=(10, 6))
sns.histplot(ed['Price'], bins=30)
plt.title('Price Distribution')
plt.show()
A seaborn histogram provides insights into the distribution of Ethereum prices, such as the range of prices and any skewness in the data.
Scatter Plot: Price vs. Volume
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Price', y='Volume', data=ed)
plt.title('Price vs. Volume')
plt.show()
This seaborn scatter plot explores the relationship between Ethereum's price and trading volume, potentially revealing correlation patterns.
Box Plot: Price Distribution by Quarter
plt.figure(figsize=(10, 6))
sns.boxplot(x='quarter', y='Price', data=ed)
plt.title('Price Distribution by Quarter')
plt.show()
The seaborn box plot divides the Ethereum price data by quarters, showing how prices vary throughout the year and identifying any outliers.
Correlation Heatmap
plt.figure(figsize=(10, 10))
sns.heatmap(ed[['Price', 'Volume', 'Market Cap', 'price_7day_avg', 'volume_7day_avg']].corr(), annot=True, fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()
A seaborn heatmap is used to visualize the correlation between different features like 'Price', 'Volume', 'Market Cap', etc. It helps to identify which features are most strongly related to each other.
Pair Plot
sns.pairplot(ed[['Price', 'Volume', 'Market Cap']])
plt.show()
A seaborn pair plot offers a comprehensive view of bivariate relationships between multiple features ('Price', 'Volume', 'Market Cap'), including scatter plots and histograms.
Violin Plot: Price Distribution by Weekday
This seaborn violin plot shows the distribution of Ethereum prices across different weekdays, helping to identify any weekly patterns.
plt.figure(figsize=(10, 6))
sns.violinplot(x='weekday', y='Price', data=ed)
plt.title('Price Distribution by Weekday')
plt.show()
This seaborn violin plot shows the distribution of Ethereum prices across different weekdays, helping to identify any weekly patterns.
Facet Grid: Price Trend by Quarter
g = sns.FacetGrid(ed, col='quarter', height=4, aspect=1)
g = g.map(plt.plot, 'date', 'Price')
plt.show()
A seaborn FacetGrid is used to create a series of line plots, each representing Ethereum price trends in different quarters, allowing for a comparative analysis across quarters.
Density Plot: Price Distribution
plt.figure(figsize=(10, 6))
sns.kdeplot(ed['Price'], shade=True)
plt.title('Density Plot for Price')
plt.show()
C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\2343393537.py:2: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code.
A seaborn KDE plot visualizes the density distribution of Ethereum prices, providing a smoothed representation of the data.
Seasonal Decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
ed['date'] = pd.to_datetime(ed['date'])
ed.set_index('date', inplace=True)
result = seasonal_decompose(ed['Price'], model='additive', period=365)
result.plot()
plt.show()
Using statsmodels, a seasonal decomposition of Ethereum prices is conducted to separate the time series into trend, seasonality, and residuals.
Swarm Plot: Price Distribution by Weekday
plt.figure(figsize=(10, 6))
sns.swarmplot(x='weekday', y='Price', data=ed)
plt.title('Price Distribution by Weekday')
plt.show()
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning: 38.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning: 40.7% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning: 39.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning: 40.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning: 38.7% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning: 38.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning: 39.7% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning: 41.8% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning: 41.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning: 41.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning: 40.6% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning: 39.8% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
This seaborn swarm plot offers a detailed view of how Ethereum prices vary on different weekdays. It provides insights into weekly trends and price dispersion.
Pair Grid: Comprehensive Analysis of Price, Volume, and Market Cap
g = sns.PairGrid(ed[['Price', 'Volume', 'Market Cap']])
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot, colors='blue')
g.map_diag(sns.histplot, kde=True)
plt.show()
Candlestick Chart Using Plotly Lastly, we create a candlestick chart to visualize Ethereum price movements in a more detailed and visually appealing manner.
ed['date'] = ed['date'].dt.strftime('%Y-%m-%d')
import plotly.graph_objects as go
fig = go.Figure(data=[go.Candlestick(x=ed['date'],
open=ed['Price'],
high=ed['Price']*1.02, # Simulated high price (2% higher)
low=ed['Price']*0.98, # Simulated low price (2% lower)
close=ed['Price'])])
fig.update_layout(title='Ethereum Price Candlestick Chart', xaxis_title='Date', yaxis_title='Price (USD)')
fig.show()
Stacking Regressor for Ethereum Price Prediction
We employ a stacking approach, combining multiple regression models to improve the prediction accuracy.
from sklearn.ensemble import StackingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, RandomizedSearchCV
import numpy as np
import pandas as pd
ethereum_data = pd.DataFrame({
'Volume': np.random.rand(100),
'Market Cap': np.random.rand(100),
'year': np.random.randint(2015, 2023, 100),
'month': np.random.randint(1, 13, 100),
'day': np.random.randint(1, 32, 100),
'Price': np.random.rand(100) * 1000
})
X = ethereum_data[['Volume', 'Market Cap', 'year', 'month', 'day']]
y = ethereum_data['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Block 2: Configure and Train Stacking Regressor
estimators = [
('lr', LinearRegression()),
('dt', DecisionTreeRegressor(max_depth=5)),
('svr', SVR(kernel='linear', C=0.1))
]
stacking_regressor = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stacking_regressor.fit(X_train, y_train)
print('Stacking Model Score:', stacking_regressor.score(X_test, y_test))
Stacking Model Score: -0.0553694531956892
Optimizing RandomForestRegressor with RandomizedSearchCV
This block configures a RandomizedSearchCV to find the best hyperparameters for a RandomForestRegressor, aiming to improve model performance.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
import numpy as np
param_distributions = {
'n_estimators': np.arange(100, 301, 100),
'max_depth': [None, 10, 20],
'min_samples_split': [5, 10],
'min_samples_leaf': [2, 4],
'max_features': [None, 'sqrt', 'log2'] # Changed 'auto' to None
}
rf = RandomForestRegressor(random_state=42)
rf_random = RandomizedSearchCV(
estimator=rf,
param_distributions=param_distributions,
n_iter=30,
cv=2,
verbose=2,
random_state=42,
n_jobs=-1,
error_score=np.nan # Continue on error with nan score
)
X_train, y_train = np.random.rand(100, 5), np.random.rand(100) # Example data, replace with your actual data
rf_random.fit(X_train, y_train)
Fitting 2 folds for each of 30 candidates, totalling 60 fits
RandomizedSearchCV(cv=2, estimator=RandomForestRegressor(random_state=42),
n_iter=30, n_jobs=-1,
param_distributions={'max_depth': [None, 10, 20],
'max_features': [None, 'sqrt', 'log2'],
'min_samples_leaf': [2, 4],
'min_samples_split': [5, 10],
'n_estimators': array([100, 200, 300])},
random_state=42, verbose=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=2, estimator=RandomForestRegressor(random_state=42),
n_iter=30, n_jobs=-1,
param_distributions={'max_depth': [None, 10, 20],
'max_features': [None, 'sqrt', 'log2'],
'min_samples_leaf': [2, 4],
'min_samples_split': [5, 10],
'n_estimators': array([100, 200, 300])},
random_state=42, verbose=2)RandomForestRegressor(random_state=42)
RandomForestRegressor(random_state=42)
Block 4: Output Feature Importance from Best Model
X = ethereum_data.drop('Price', axis=1)
y = ethereum_data['Price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
param_grid = {
'n_estimators': [50, 100, 200],
'max_features': ['sqrt', 'log2'],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(estimator=RandomForestRegressor(), param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
print('Best Parameters:', grid_search.best_params_)
Fitting 3 folds for each of 144 candidates, totalling 432 fits
Best Parameters: {'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 50}
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
ed.fillna(method='ffill', inplace=True)
X = ed[['Volume', 'Market Cap', 'year', 'month', 'day']]
y = ed['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'Root Mean Squared Error: {rmse}')
Root Mean Squared Error: 39.47773162973353
ARIMA Model for Time Series Forecasting
To further our analysis, we apply an ARIMA model to forecast Ethereum prices based on historical data.
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
ed['date'] = pd.to_datetime(ed[['year', 'month', 'day']])
ed.set_index('date', inplace=True)
ed.index = pd.DatetimeIndex(ed.index).to_period('D')
model_arima = ARIMA(ed['Price'], order=(5,1,0))
model_arima_fit = model_arima.fit()
print(model_arima_fit.summary())
SARIMAX Results
==============================================================================
Dep. Variable: Price No. Observations: 2964
Model: ARIMA(5, 1, 0) Log Likelihood -16537.729
Date: Thu, 02 May 2024 AIC 33087.458
Time: 10:26:32 BIC 33123.422
Sample: 10-21-2015 HQIC 33100.403
- 12-02-2023
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 -0.0654 0.009 -7.557 0.000 -0.082 -0.048
ar.L2 0.0273 0.008 3.597 0.000 0.012 0.042
ar.L3 0.0251 0.008 3.112 0.002 0.009 0.041
ar.L4 0.0286 0.008 3.508 0.000 0.013 0.045
ar.L5 -0.0604 0.007 -8.566 0.000 -0.074 -0.047
sigma2 4132.4564 34.452 119.947 0.000 4064.931 4199.982
===================================================================================
Ljung-Box (L1) (Q): 0.10 Jarque-Bera (JB): 62028.38
Prob(Q): 0.76 Prob(JB): 0.00
Heteroskedasticity (H): 16.84 Skew: -0.97
Prob(H) (two-sided): 0.00 Kurtosis: 25.33
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
This section covers the use of Scikit-learn's train_test_split to divide the data into training and test sets, and LinearRegression for modeling Ethereum prices. The model is trained on selected features like 'Volume', 'Market Cap', and date components, and its performance is evaluated using the root mean squared error (RMSE).
In this section, we apply Ridge Regression to our dataset. Ridge Regression is a type of linear regression that includes a regularization term. This regularization term (L2 penalty) discourages learning overly complex models to prevent overfitting. We scale our features using StandardScaler to normalize the data, ensuring that the model isn't biased towards variables on a larger scale.
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import numpy as np
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X_train_scaled, y_train)
y_pred_ridge = ridge_reg.predict(X_test_scaled)
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
print(f'Ridge Regression RMSE: {rmse_ridge}')
Ridge Regression RMSE: 39.52609293882694
Here we use Hyperopt, a library for serial and parallel optimization over awkward search spaces, which may include real-valued, discrete, and conditional dimensions. We define an objective function to minimize, set up a space of hyperparameters, and use the Tree-structured Parzen Estimator (TPE) method to find the best hyperparameters for a RandomForestRegressor model.
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
def objective(space):
model = RandomForestRegressor(n_estimators=int(space['n_estimators']),
max_depth=int(space['max_depth']),
min_samples_split=int(space['min_samples_split']),
min_samples_leaf=int(space['min_samples_leaf']))
model.fit(X_train, y_train)
pred = model.predict(X_test)
mse = mean_squared_error(y_test, pred)
return {'loss': mse, 'status': STATUS_OK}
space = {
'n_estimators': hp.quniform('n_estimators', 100, 1000, 100),
'max_depth': hp.quniform('max_depth', 10, 50, 10),
'min_samples_split': hp.choice('min_samples_split', [2, 5, 10]),
'min_samples_leaf': hp.choice('min_samples_leaf', [1, 2, 4])
}
trials = Trials()
best = fmin(fn=objective,
space=space,
algo=tpe.suggest,
max_evals=100,
trials=trials)
print("Best hyperparameters:", best)
100%|██████████| 100/100 [23:23<00:00, 14.03s/trial, best loss: 195.38881312438133]
Best hyperparameters: {'max_depth': 50.0, 'min_samples_leaf': 0, 'min_samples_split': 0, 'n_estimators': 900.0}
The Sharpe ratio is used to measure the performance of an investment compared to a risk-free asset, after adjusting for its risk. It is the average return earned in excess of the risk-free rate per unit of volatility or total risk. Calculating the Sharpe ratio is useful for understanding the return of an investment compared to its risk.
import numpy as np
def sharpe_ratio(returns):
mean_returns = np.mean(returns)
std_returns = np.std(returns)
sharpe_ratio = mean_returns / std_returns * np.sqrt(252) # Assuming daily returns
return sharpe_ratio
ed['returns'] = ed['Price'].pct_change()
print("Sharpe Ratio:", sharpe_ratio(ed['returns'].dropna()))
Sharpe Ratio: 1.2621260038434017
Lasso
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X_train, y_train)
y_pred_lasso = lasso_reg.predict(X_test)
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
print(f'Lasso Regression RMSE: {rmse_lasso}')
Lasso Regression RMSE: 39.47820594449236
Implementing Decision Tree Regression
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)
y_pred_tree = tree_reg.predict(X_test)
rmse_tree = np.sqrt(mean_squared_error(y_test, y_pred_tree))
print(f'Decision Tree Regression RMSE: {rmse_tree}')
Decision Tree Regression RMSE: 14.733249675918618
In this section, we explore the Decision Tree Regression model, known for its ability to capture complex, non-linear relationships in data. After training the model on the Ethereum dataset, we evaluate its performance using the RMSE metric, providing insights into its effectiveness compared to simpler models.
Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators=100)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print(f'Random Forest Regression RMSE: {rmse_rf}')
Random Forest Regression RMSE: 14.261140429352936
This section focuses on Random Forest Regression, an advanced ensemble method that combines multiple decision trees to enhance predictive accuracy and robustness. After fitting the model to our Ethereum dataset, we assess its performance using the RMSE value, comparing it against previous models to gauge its relative effectiveness. Blow is a Residual Plot to visualise the distribution of errors:
import matplotlib.pyplot as plt
rf_reg = RandomForestRegressor(n_estimators=100)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
residuals = y_test - y_pred_rf
plt.scatter(y_pred_rf, residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot for Random Forest Regression')
plt.axhline(y=0, color='r', linestyle='--')
plt.show()
SVR
from sklearn.svm import SVR
svr_reg = SVR(kernel='rbf')
svr_reg.fit(X_train, y_train)
y_pred_svr = svr_reg.predict(X_test)
rmse_svr = np.sqrt(mean_squared_error(y_test, y_pred_svr))
print(f'Support Vector Regression RMSE: {rmse_svr}')
Support Vector Regression RMSE: 707.9913315066016
In this section, we implement Support Vector Regression (SVR), a versatile machine learning algorithm, on the Ethereum dataset. SVR is known for its effectiveness in handling non-linear relationships. We employ the Radial Basis Function (RBF) kernel and evaluate the model's performance using the RMSE metric, providing insights into its predictive accuracy.
Gradient Boosting Regression Implementation
from sklearn.ensemble import GradientBoostingRegressor
gb_reg = GradientBoostingRegressor(n_estimators=100)
gb_reg.fit(X_train, y_train)
y_pred_gb = gb_reg.predict(X_test)
rmse_gb = np.sqrt(mean_squared_error(y_test, y_pred_gb))
print(f'Gradient Boosting Regression RMSE: {rmse_gb}')
Gradient Boosting Regression RMSE: 15.29218574410884
This section focuses on the implementation of Gradient Boosting Regression, a powerful ensemble learning technique. Gradient Boosting Regression builds an additive model in a forward stage-wise fashion, allowing for the optimization of arbitrary differentiable loss functions. The model is trained on Ethereum dataset features with 100 estimators, and its effectiveness is evaluated using the Root Mean Squared Error (RMSE) metric.
Model Performance Comparison
print(f'Ridge Regression RMSE: {rmse_ridge}')
print(f'Lasso Regression RMSE: {rmse_lasso}')
print(f'Decision Tree Regression RMSE: {rmse_tree}')
print(f'Random Forest Regression RMSE: {rmse_rf}')
print(f'Support Vector Regression RMSE: {rmse_svr}')
print(f'Gradient Boosting Regression RMSE: {rmse_gb}')
Ridge Regression RMSE: 39.52609293882694 Lasso Regression RMSE: 39.47820594449236 Decision Tree Regression RMSE: 14.733249675918618 Random Forest Regression RMSE: 14.261140429352936 Support Vector Regression RMSE: 707.9913315066016 Gradient Boosting Regression RMSE: 15.29218574410884
In this section, the performance of all implemented models is compared using the Root Mean Squared Error (RMSE) metric. This comparison is crucial for determining the most effective model for predicting Ethereum prices. The RMSE values for Ridge Regression, Lasso Regression, Decision Tree Regression, Random Forest Regression, Support Vector Regression, and Gradient Boosting Regression are displayed, providing insights into each model's accuracy and predictive power.
Custom Accuracy Metric
def custom_accuracy(y_true, y_pred, threshold=0.01):
"""
Calculate the percentage of predictions within a certain threshold.
:param y_true: Actual values
:param y_pred: Predicted values
:param threshold: Threshold for considering a prediction accurate (default 1%)
:return: Accuracy as a percentage
"""
within_threshold = np.abs(y_true - y_pred) <= threshold * np.abs(y_true)
accuracy = np.mean(within_threshold)
return accuracy * 100
accuracy = custom_accuracy(y_test, y_pred_rf)
print(f'Custom Accuracy: {accuracy:.2f}%')
Custom Accuracy: 69.48%
This section introduces a custom accuracy metric designed to evaluate the model's predictions based on a specified threshold. The custom accuracy function, named custom_accuracy, calculates the percentage of predictions that fall within a certain margin (threshold) of the actual values. This metric is particularly useful for understanding the practical effectiveness of the model in scenarios where slight deviations from the actual values are acceptable.
We begin by performing cross-validation on a Random Forest Regressor to evaluate its performance more robustly. This is crucial to ensure that our model is not only fitting to a particular subset of the data. Here, we use the cross_val_score function with 5 folds, which will provide us with insights into the model's stability and performance across different subsets of the data. We use the negative mean squared error as the scoring method and calculate the RMSE (Root Mean Squared Error) for each fold to get a sense of the average error magnitude.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators=100)
cv_scores = cross_val_score(rf_reg, X, y, cv=5, scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-cv_scores)
print("RMSE scores for each fold:", rmse_scores)
print(f"Mean RMSE: {np.mean(rmse_scores)}")
print(f"Standard Deviation of RMSE: {np.std(rmse_scores)}")
RMSE scores for each fold: [ 71.66345284 71.33436172 17.16693587 481.22333446 87.74440581] Mean RMSE: 145.82649813889662 Standard Deviation of RMSE: 169.39131376842155
Next, we employ time series cross-validation to evaluate the Random Forest model. This method is particularly useful when dealing with time series data, as it respects the temporal order of observations. We use the TimeSeriesSplit from sklearn, specifying 5 splits, and print out the score for each fold to observe how the model performs over time.
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
model = RandomForestRegressor(**grid_search.best_params_)
for train_index, test_index in tscv.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
model.fit(X_train, y_train)
print('Fold Score:', model.score(X_test, y_test))
Fold Score: -1.9445148450106546 Fold Score: -4.397333022336436 Fold Score: 0.6594040240142576 Fold Score: -2.519497914536573 Fold Score: -2.1244451417978993
We enhance our model tuning by conducting a Randomized Search for the best hyperparameters. Randomized Search offers a probabilistic approach that searches the parameter space more efficiently than GridSearchCV. Here, we define a range of values for hyperparameters and use RandomizedSearchCV to find the best combination based on the specified distribution.
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
param_random = {
'n_estimators': [100, 200, 300, 400, 500],
'max_features': ['sqrt', 'log2'],
'max_depth': [10, 20, 30, 40, 50, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'bootstrap': [True, False]
}
rf_random = RandomForestRegressor()
random_search = RandomizedSearchCV(estimator=rf_random, param_distributions=param_random,
n_iter=100, cv=3, verbose=2, random_state=42, n_jobs=-1)
random_search.fit(X_train, y_train)
print("Randomized Search Best Parameters:", random_search.best_params_)
max_depth = random_search.best_params_['max_depth']
if max_depth is None:
max_depth_values = [None]
else:
max_depth_values = [max_depth - 10 if max_depth > 10 else 5, max_depth, max_depth + 10]
param_grid = {
'n_estimators': [random_search.best_params_['n_estimators']],
'max_features': [random_search.best_params_['max_features']],
'max_depth': max_depth_values,
'min_samples_split': [random_search.best_params_['min_samples_split']],
'min_samples_leaf': [random_search.best_params_['min_samples_leaf']]
}
grid_search = GridSearchCV(estimator=RandomForestRegressor(), param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
print('Grid Search Best Parameters:', grid_search.best_params_)
Fitting 3 folds for each of 100 candidates, totalling 300 fits
Randomized Search Best Parameters: {'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 10, 'bootstrap': False}
Fitting 3 folds for each of 3 candidates, totalling 9 fits
Grid Search Best Parameters: {'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}
We enhance our model tuning by conducting a Randomized Search for the best hyperparameters. Randomized Search offers a probabilistic approach that searches the parameter space more efficiently than GridSearchCV. Here, we define a range of values for hyperparameters and use RandomizedSearchCV to find the best combination based on the specified distribution.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
param_grid = {
'n_estimators': [100, 200, 300, 400, 500],
'max_features': ['sqrt', 'log2'], # Update 'max_features' here
'max_depth': [10, 20, 30, 40, 50, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'bootstrap': [True, False]
}
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=param_grid,
n_iter=100, cv=3, verbose=2, random_state=42, n_jobs=-1,
error_score='raise')
rf_random.fit(X_train, y_train)
print("Best Parameters:", rf_random.best_params_)
Fitting 3 folds for each of 100 candidates, totalling 300 fits
Best Parameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 50, 'bootstrap': False}
After finding the best hyperparameters, we train a Random Forest Regressor using these parameters and examine the feature importances. This allows us to see which features are most influential in predicting the Ethereum price.
from sklearn.ensemble import RandomForestRegressor
best_params = {
'n_estimators': rf_random.best_params_['n_estimators'],
'max_features': rf_random.best_params_['max_features'],
'max_depth': rf_random.best_params_['max_depth'],
'min_samples_split': rf_random.best_params_['min_samples_split'],
'min_samples_leaf': rf_random.best_params_['min_samples_leaf'],
'bootstrap': rf_random.best_params_['bootstrap']
}
final_rf_reg = RandomForestRegressor(**best_params)
final_rf_reg.fit(X_train, y_train)
y_pred_rf_final = final_rf_reg.predict(X_test)
rmse_rf_final = np.sqrt(mean_squared_error(y_test, y_pred_rf_final))
print(f'Random Forest Regression RMSE (with best hyperparameters): {rmse_rf_final}')
Random Forest Regression RMSE (with best hyperparameters): 289.4496998373119
In this section, we utilize TimeSeriesSplit from sklearn to perform time series cross-validation. This is particularly suitable for time series data to validate the model in a way that respects the temporal order of observations. We use the best estimator from a previous RandomizedSearchCV, and calculate the negative mean squared error across each fold. We then compute the root mean squared error (RMSE) for each split to assess the model's performance over time.
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(rf_random.best_estimator_, X_train, y_train, cv=tscv, scoring='neg_mean_squared_error')
print("Time-series CV scores:", np.sqrt(-scores))
Time-series CV scores: [376.93872734 193.41857629 16.38233049 895.52961301 994.12721737]
After cross-validation, we perform a backtest by splitting the data at a certain point in time (70% of the data for training and the rest for testing). This method is commonly used in financial modeling to simulate the model's performance on unseen data as if it were being used in practice.
split_index = int(len(X) * 0.7)
X_train_bt, X_test_bt = X[:split_index], X[split_index:]
y_train_bt, y_test_bt = y[:split_index], y[split_index:]
rf_random.best_estimator_.fit(X_train_bt, y_train_bt)
predictions = rf_random.best_estimator_.predict(X_test_bt)
mse = mean_squared_error(y_test_bt, predictions)
print("Backtest MSE:", mse)
Backtest MSE: 318962.1721488143
Finally, we visualize the actual vs. predicted prices using Plotly, a powerful library for creating interactive charts. This visualization helps in understanding the accuracy of the predictions in a more intuitive and graphical format.
import plotly.graph_objs as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(len(y_test_bt)), y=y_test_bt, mode='lines', name='Actual'))
fig.add_trace(go.Scatter(x=np.arange(len(y_test_bt)), y=predictions, mode='lines', name='Predicted'))
fig.update_layout(title='Actual vs Predicted Prices', xaxis_title='Index', yaxis_title='Price')
fig.show()
This section initializes machine learning models and applies data imputation to handle missing values in the dataset. We use SimpleImputer to replace missing values with the median of each column. We then train two different models: RandomForestRegressor and LinearRegression, to predict our target variable. The root mean squared error (RMSE) is calculated for each model to evaluate their performance.
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
models = {
'RandomForestRegressor': RandomForestRegressor(),
'LinearRegression': LinearRegression()
}
from sklearn.impute import SimpleImputer
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)
results = {}
importances = {}
for name, model in models.items():
pipeline = make_pipeline(SimpleImputer(strategy='median'), model)
pipeline.fit(X_train_imputed, y_train)
y_pred = pipeline.predict(X_test_imputed)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
results[name] = rmse
print(f'{name} RMSE: {rmse:.4f}')
if hasattr(model, 'feature_importances_'):
importances[name] = model.feature_importances_
if importances:
for name, importance in importances.items():
features = X_train.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': importance}).sort_values(by='Importance', ascending=False)
print(f'\n{name} Feature Importances:')
print(importance_df)
RandomForestRegressor RMSE: 14.2672
LinearRegression RMSE: 39.4777
RandomForestRegressor Feature Importances:
Feature Importance
1 Market Cap 0.999622
2 year 0.000153
0 Volume 0.000147
3 month 0.000041
4 day 0.000036
After training the models, we print out the RMSE results for each. This step helps in comparing the effectiveness of the RandomForestRegressor with LinearRegression based on their RMSE values.
rmse_results = {
'LSTM 50x1': 0.02627,
'LSTM 50x2': 0.02557,
'LSTM 50x3': 0.04761,
'LSTM 100x1': 0.02177,
'LSTM 100x2': 0.02226,
'LSTM 100x3': 0.04501,
'Linear Regression': 39.4777,
'Ridge Regression': 39.4777,
'Lasso Regression': 39.4782,
'Decision Tree': 17.4592,
'Random Forest': 14.3826,
'Support Vector Regression (SVR)': 707.9913,
'Gradient Boosting': 15.2922,
'RF Cross-Validation': 144.3496,
'RF Best Params': 378.5620
}
# Printing the RMSE results
print("RMSE Results for Various Models:")
for model, rmse in rmse_results.items():
print(f'{model}: {rmse:.4f}')
RMSE Results for Various Models: LSTM 50x1: 0.0263 LSTM 50x2: 0.0256 LSTM 50x3: 0.0476 LSTM 100x1: 0.0218 LSTM 100x2: 0.0223 LSTM 100x3: 0.0450 Linear Regression: 39.4777 Ridge Regression: 39.4777 Lasso Regression: 39.4782 Decision Tree: 17.4592 Random Forest: 14.3826 Support Vector Regression (SVR): 707.9913 Gradient Boosting: 15.2922 RF Cross-Validation: 144.3496 RF Best Params: 378.5620
We also include a function to evaluate different metrics of the model's performance, such as RMSE, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), R-squared, and accuracy within a certain threshold. This function helps in a comprehensive assessment of model performance.
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def evaluate_model(y_true, y_pred):
print("Shape of y_true before adjustment:", y_true.shape)
print("Shape of y_pred before adjustment:", y_pred.shape)
if y_true.shape[0] != y_pred.shape[0]:
min_len = min(y_true.shape[0], y_pred.shape[0])
y_true = y_true[:min_len]
y_pred = y_pred[:min_len]
if y_pred.ndim > 1:
y_pred = y_pred.flatten()
print("Shape of y_true after adjustment:", y_true.shape)
print("Shape of y_pred after adjustment:", y_pred.shape)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
r2 = r2_score(y_true, y_pred)
threshold = 0.05 # 5% threshold
within_threshold = np.abs((y_true - y_pred) / y_true) <= threshold
accuracy = np.mean(within_threshold) * 100
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"MAPE: {mape:.2f}%")
print(f"R-squared: {r2:.4f}")
print(f"Accuracy (within {threshold*100}%): {accuracy:.2f}%")
evaluate_model(y_test, test_predict)
Shape of y_true before adjustment: (593,) Shape of y_pred before adjustment: (591, 1) Shape of y_true after adjustment: (591,) Shape of y_pred after adjustment: (591,) RMSE: 1406.6387 MAE: 1243.4258 MAPE: 8752.02% R-squared: -0.7480 Accuracy (within 5.0%): 3.72%
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R-squared: {r2:.4f}")
param_grid = {
'n_estimators': [100, 200],
'max_features': ['sqrt'], # Changed 'auto' to 'sqrt'
'max_depth': [None, 10, 20],
'min_samples_split': [2, 10],
'min_samples_leaf': [1, 4]
}
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, n_iter=10, cv=5, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
random_search.fit(X_train, y_train)
print("Best parameters:", random_search.best_params_)
best_rf = random_search.best_estimator_
y_pred_best = best_rf.predict(X_test)
rmse_best = np.sqrt(mean_squared_error(y_test, y_pred_best))
print(f"Best RMSE: {rmse_best:.4f}")
RMSE: 14.0666
MAE: 6.1979
R-squared: 0.9998
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 10}
Best RMSE: 28.9350
This section covers the initial setup, training, and evaluation of a RandomForestRegressor. We train the model on the training set and then make predictions on the test set. The model's performance is evaluated using the root mean squared error (RMSE), mean absolute error (MAE), and the R-squared score, which provides an indication of the goodness of fit of the predictions.
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import matplotlib.pyplot as plt
import numpy as np
# Prepare the data and split it
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
# Fit the model
try:
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
except Exception as e:
print("An error occurred during model training or prediction:")
print(e)
raise
# Calculate performance metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Model Performance Metrics:")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"Mean Absolute Error: {mae:.4f}")
print(f"R-squared: {r2:.4f}")
# Define parameter grid
param_grid = {
'n_estimators': [100, 200],
'max_features': ['sqrt'], # Corrected from 'auto' to 'sqrt'
'max_depth': [None, 10, 20],
'min_samples_split': [2, 10],
'min_samples_leaf': [1, 4]
}
# Setup RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, n_iter=10, cv=5, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
random_search.fit(X_train, y_train)
# Output the best parameters and the best RMSE
print("Best parameters found by RandomizedSearchCV:")
print(random_search.best_params_)
best_rf = random_search.best_estimator_
y_pred_best = best_rf.predict(X_test)
rmse_best = np.sqrt(mean_squared_error(y_test, y_pred_best))
print(f"Best RMSE from RandomizedSearchCV: {rmse_best:.4f}")
Model Performance Metrics:
Root Mean Squared Error: 14.0666
Mean Absolute Error: 6.1979
R-squared: 0.9998
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters found by RandomizedSearchCV:
{'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': None}
Best RMSE from RandomizedSearchCV: 24.2930
To visually assess the model's performance, we plot the actual vs. predicted values. This plot helps identify how well the predicted values match the actual values and highlights any potential areas where the model may be underperforming.
# Plot Actual vs Predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_best, alpha=0.75, color='red', edgecolors='b')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted Values')
plt.show()
After identifying the best parameters from the RandomizedSearchCV, we retrain the RandomForestRegressor with these optimized parameters and evaluate its performance again using the RMSE metric.
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
rf_reg = RandomForestRegressor(
n_estimators=best_rf.get_params()['n_estimators'],
max_features=best_rf.get_params()['max_features'],
max_depth=best_rf.get_params()['max_depth'],
min_samples_split=best_rf.get_params()['min_samples_split'],
min_samples_leaf=best_rf.get_params()['min_samples_leaf'],
random_state=42
)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print(f'Random Forest Regression RMSE: {rmse_rf}')
Random Forest Regression RMSE: 24.292987762035448
References
Python Software Foundation. (2023). Python 3.10.4 documentation. Available at: https://docs.python.org/3/. [Accessed 8 December 2023].
Pandas Development Team. (2023). pandas: powerful Python data analysis toolkit. Available at: https://pandas.pydata.org/pandas-docs/stable/. [Accessed 8 December 2023].
Harris, C.R., Millman, K.J., van der Walt, S.J. et al. (2020). Array programming with NumPy. Available at: https://numpy.org/doc/stable/. [Accessed 8 December 2023].
Hunter, J.D., Dale, D., Firing, E., Droettboom, M. (2023). Matplotlib: Visualization with Python. Available at: https://matplotlib.org/stable/users/index.html. [Accessed 8 December 2023].
Waskom, M.L. (2023). Seaborn: statistical data visualization. Available at: https://seaborn.pydata.org/. [Accessed 8 December 2023].
Virtanen, P., Gommers, R., Oliphant, T.E., et al. (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Available at: https://docs.scipy.org/doc/scipy/reference/. [Accessed 8 December 2023].
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). Scikit-learn: Machine Learning in Python. Available at: https://scikit-learn.org/stable/. [Accessed 8 December 2023].
Seabold, S., Perktold, J. (2010). Statsmodels: Econometric and Statistical Modeling with Python. Available at: https://www.statsmodels.org/stable/index.html. [Accessed 8 December 2023].